Client Profile.
The company is a Tennessee-based property data-solutions provider, delivering technology-driven property intelligence to customers across the USA. Through innovative products and services, the company helps realtors, appraisers, investors, and mortgage companies meet their immense and diverse property data requirements.
Business Need.
To provide comprehensive, accurate and current property information to its customers, the company engages in ongoing data aggregation activities at scale from multiple online and offline sources. The data points aggregated include property name, location, owner, valuation, neighborhood area among others. This property data is collected/captured from:
- Thousands of transactional and other property documents including assignment, discharge, foreclosure agreements, warranty deeds, contract sales, mortgage, and auction etc. These were available in multiple formats including text, pdf, image etc.
- Online sources such as property websites across three states and 195 counties.
To manage voluminous data aggregation activities and ensure that quality, relevance and currency of data was constantly maintained, the company partnered with HitechDigital. The partnership also aimed to get increased scalability and flexibility while optimizing operational costs.
Challenges.
- Identifying relevant property data points from thousands of property records required significant understanding of property terminology and domain expertise.
- Most of the documents were unstructured and relevant data points were positioned differently in each. This made the data capture process complex.
- Decoding data from handwritten documents was also a challenge.
- Different counties and states had different requirements in terms of regulations and compliances and this necessitated a specific understanding of each of them.
- Several documents contained sensitive property information; maintaining data security and confidentiality was crucial.
Solution.
Data specialists at HitechDigital implemented a property data aggregation workflow for ongoing capture of hundreds and thousands of data points from multiple offline and online data sources. The structured workflow was a blend of manual and automated data capture and validation processes.
Approach.
Hiring and Training
- Given the knowledge intensive nature of the project, a team of experienced data specialists was hired and capabilities were built for processing hundreds of property document types. They were trained on understanding property types and associated terms, comprehending complex property documents, state and county compliances and regulations etc.
Data Extraction from Documents
- Manually extract data from documents shared by client in pdf, word or image formats and input them into a robust database in csv format.
- Missing data fields were added by doing a web research.
- Standardize the data to maintain format consistency.
Data Collection from online sources
- Custom crawlers were developed using Python-based scripts; they were trained to recognize and capture specific property data points.
- The crawlers fetched property records from online property websites across three states and 195 county websites and imported them into a robust csv database.
Audit and Quality Check
- Ran a double-layered audit to ensure accuracy of records.
- This included manual quality checks based on random sampling and rule-based validation for complete dataset.
Deliverables
- Validated and standardized data captured from online and document sources was uploaded onto client portal using Access DB through secured credentials.
Tools and Technology Used: Python, Scripts and Macros