Data Analyst Intern at Bluebono

Jun 1, 2025 min read

Overview

At Bluebono, I designed and implemented a comprehensive loan-to-value estimation and market-score evaluation pipeline for residential properties. This project covered the full data science lifecycle from project scoping and data acquisition to model deployment.

Key Contributions

Data Acquisition & Vendor Management

Directed dataset acquisition efforts by:

  • Evaluating multiple data vendors and their offerings
  • Negotiating contracts for optimal pricing and data access
  • Integrating a real-time data feed from the California Regional Multiple Listing Service (CRMLS) via the Trestle API

This established a reliable, up-to-date data pipeline for property information across California.

Large-Scale Data Engineering

Engineered and processed a massive dataset consisting of:

  • 11.3 million property records
  • 1,000+ market features per property

This required building robust ETL pipelines capable of handling data at scale while maintaining data quality and consistency.

Feature Engineering & Dimensionality Reduction

Applied unsupervised feature clustering techniques to:

  • Group highly correlated variables together
  • Reduce dimensionality by 65%
  • Improve model interpretability without sacrificing predictive power

This approach made the resulting models more explainable for business stakeholders while maintaining strong performance.

Technical Stack

  • Data Processing: Python, Pandas, SQL
  • APIs: Trestle API (CRMLS)
  • ML Techniques: Unsupervised Clustering, Feature Selection
  • Scale: 11.3M+ records, 1000+ features

Impact

The pipeline I built enables Bluebono to provide accurate property valuations and market scores, helping clients make informed decisions in the competitive California real estate market.