Client Profile.
The client is a US based publisher of digital and print periodicals in the real estate industry. Located in New England; the client shares property insights for numerous counties across the USA to its subscribed customers.
Business Needs.
The client was publishing periodicals for real estate industry and had limited subscription database to increase the reach of customers. In order to increase their circulation reach and derive specific insights for tailored marketing campaigns, they needed to aggregate larger datasets. They needed a partner to:
- Capture the data such as contact details, geography and county information etc. from various internet sources like MLS sites, county websites, property documents etc.
- Aggregate and cleanse the unstructured dataset for accuracy and normalcy
- Standardize the database to make fit for presentation and deriving insights
The Challenges.
Given that the information was to be captured from multiple sources, there were major data acquisition challenges such as:
- Managing huge volume of data sources for more than 10 million documents of property and scanning each one for accurate information
- Maintaining up-to-date records such as owner, contact details etc. for all property
- Developing a cleansing mechanism for huge unstructured data to make it CRM-ready
- Right skillsets to accurately capture data from different data sources and formats
- Domain knowledge of the geography for categorizing the data
Solution.
A mixed framework of manual and automation methods for data aggregation was set up to collect, collate, cleanse and create marketing-ready insightful database. Multi-layered quality checks were designed to ensure data accuracy and normalcy.
Approach.
- A data aggregation workflow was designed to allocate tasks to a team of 500+ data entry experts. The team:
- Worked in shifts to aggregate data from multiple geographies, time zones and ensure meeting the fluctuating workload
- Could accommodate 30% volume fluctuations without affecting productivity and quality of deliverables
- According to a predefined schedule, data was scrapped and aggregated from various MLS websites, sales deeds, mortgage papers and other information available on county websites through automated and manual methods
- The team designed custom scraper bots to extract data from the websites and bots trained to pass the captcha codes whenever encountered
- Obtained information was cleansed and populated to the client’s database as per format
- 28,000+ potential customer records including multiple demographic data points from six different states and counties were aggregated
- The database was standardized to meet client’s naming convention such as zip codes, spell checks with capitalization for owner names etc., duplicate entries were removed, and incomplete records were updated
- Quality checks:
- Multi-layered quality checks by senior QC person against a predefined checklist were performed to meet the ISO standard requirement
- Logical data controls and rules-driven validation algorithms were used to check and verify every item in the records