Overview
At Bluebono, I designed and implemented a comprehensive loan-to-value estimation and market-score evaluation pipeline for residential properties. This project covered the full data science lifecycle from project scoping and data acquisition to model deployment.
Key Contributions
Data Acquisition & Vendor Management
Directed dataset acquisition efforts by:
- Evaluating multiple data vendors and their offerings
- Negotiating contracts for optimal pricing and data access
- Integrating a real-time data feed from the California Regional Multiple Listing Service (CRMLS) via the Trestle API
This established a reliable, up-to-date data pipeline for property information across California.
Large-Scale Data Engineering
Engineered and processed a massive dataset consisting of:
- 11.3 million property records
- 1,000+ market features per property
This required building robust ETL pipelines capable of handling data at scale while maintaining data quality and consistency.
Feature Engineering & Dimensionality Reduction
Applied unsupervised feature clustering techniques to:
- Group highly correlated variables together
- Reduce dimensionality by 65%
- Improve model interpretability without sacrificing predictive power
This approach made the resulting models more explainable for business stakeholders while maintaining strong performance.
Technical Stack
- Data Processing: Python, Pandas, SQL
- APIs: Trestle API (CRMLS)
- ML Techniques: Unsupervised Clustering, Feature Selection
- Scale: 11.3M+ records, 1000+ features
Impact
The pipeline I built enables Bluebono to provide accurate property valuations and market scores, helping clients make informed decisions in the competitive California real estate market.