Overview
During my Data Analyst internship at Bluebono, I worked on a residential real estate analytics project connecting local housing market conditions to loan-to-value (LTV) decision-making.
The core insight: property risk is not only about the property itself. A similar home can carry different lending risk depending on whether it sits in a liquid, competitive ZIP code or a slower market with rising inventory and longer days on market. My work focused on turning that local market context into a cleaner, more interpretable signal.
What I Built
An end-to-end data pipeline for ZIP-level residential property evaluation. The pipeline pulled together large-scale housing market data, cleaned and validated it, engineered market indicators, and fed a scoring framework that adjusts LTV recommendations by local market strength.
The project touched both data science and software engineering:
- Data scale: 11.3M+ raw property and market records across 1,000+ California ZIP codes, with history from ~2012 through 2025
- Market indicators studied: median sale price, price per square foot, homes sold, pending sales, new listings, inventory, days on market, sale-to-list ratio, off-market velocity
- Engineering: API-based property data ingestion via Trestle/CoreLogic, comparable-property matching logic
The goal was not a black-box valuation model — it was a defensible framework that could explain why one local market supports a higher LTV cap while another should be treated more conservatively.
Pipeline Details
Data preparation was the first challenge. Real estate data arrived in multiple formats: monthly market files, property records, percentage fields, price strings, dates, and region identifiers. I wrote cleaning logic to normalize columns, convert prices and percentages to numeric values, align monthly time periods, and filter to relevant California markets.
Feature engineering focused on signals that describe market strength from several angles:
| Dimension | Features |
|---|---|
| Price movement | Median sale price, price/sqft, YoY change |
| Liquidity | Homes sold, pending sales, inventory, new listings |
| Market speed | Median days on market |
| Buyer competition | Sale-to-list ratio, homes sold above list, off-market velocity |
| Stability | Volatility and longer-term trend behavior |
Dimensionality reduction: I applied unsupervised clustering to group correlated variables, reducing redundant market features by ~65%. Many real estate variables tell overlapping stories — sale price, list price, price/sqft, and YoY price movement are related but shouldn’t all over-count the same signal.
Scoring layer: I combined a transparent rule-based baseline (easy to explain to stakeholders) with a gradient-boosted model to capture nonlinear relationships. Feature-importance analysis translated the model back into business-readable drivers.
Comparable-property matching: I prototyped logic for matching a subject property to relevant comps using address similarity, geographic distance, property size, lot size, sale date, price, and listing status — connecting the ZIP-level market score to the practical workflow of evaluating a specific property.
Results
- Transformed messy provider data into structured, interpretable market features
- Reduced feature complexity by ~65% through clustering
- Created a dynamic LTV framework: instead of treating every property market the same, the system explains why stronger, more liquid markets support more confidence while volatile markets warrant conservative lending assumptions
- Built reusable components for market data cleaning, API-based ingestion, feature preparation, and comp filtering
Key Design Decisions
Market-level risk modeling. The most important choice was modeling risk at the ZIP-code level, not just the individual property. Valuation tells you what an asset may be worth today, but LTV risk also depends on how resilient that value is — a liquid, high-demand market is easier to exit than one with rising inventory and slow sales.
Respecting time. Housing data is naturally time-series data. A random train-test split would let future market behavior leak into the past. Time-based train/validation/test splits better matched the real question: given only historical data up to this point, how well can we score the next period?
Conservative missing data handling. Short gaps in monthly indicators can be filled from nearby months since these metrics move gradually. Long gaps at the start or end of a ZIP’s history are different — filling those aggressively creates false confidence, so I treated them carefully and avoided future information in validation periods.
Interpretability over complexity. In a lending-adjacent workflow, a score is only useful if stakeholders understand what drives it. Combining a rule-based baseline with ML-assisted discovery let the model reveal nonlinear patterns while keeping the final explanation grounded in business logic.