Real estate data is messy. Property records, mortgage data, tax assessments, and ownership information come from dozens of sources, update at different frequencies, and need to be queryable in ways that vary by client. Building an analytics platform that handles that volume reliably — without becoming a maintenance burden — requires making deliberate architectural choices from the start.
This is how we approached it for one of our clients, and why the stack we chose works better than the alternatives we considered.
The client needed to process millions of real estate records daily and serve custom analytics on top of that data. The requirements were straightforward but demanding: cost-effective storage, fast query performance, flexible enough to evolve the schema as the data model changed, and maintainable by a small team without dedicated infrastructure engineers.
The temptation in this type of project is to reach for distributed systems — Spark, Databricks, a managed data warehouse. We evaluated those options and decided against them. The complexity they introduce doesn't pay off at this data volume, and the operational overhead would have consumed the team.
S3 with Parquet as the file format gives you cost-effective storage for large datasets, fast query performance through columnar reads, flexible schema evolution without migrations, and built-in compression. It's not a novel choice — but it's the right one for this use case, and choosing boring technology deliberately is a valid architectural decision.
DuckDB handles the analytical processing layer. It runs in-process with zero configuration, uses SQL natively, and delivers exceptional query performance on Parquet files without requiring a cluster. For a team that knows SQL, the learning curve is minimal.
dbt sits on top as the transformation layer. It gives the project built-in testing, documentation, and clear data lineage — things that matter enormously when you're processing data that feeds business decisions. Every transformation is versioned, tested, and documented automatically.
Dagster manages the pipeline end-to-end. We chose it over Airflow because it's built around assets rather than tasks — you define what data you're producing, not just what code you're running. That makes monitoring and debugging significantly easier, and it integrates naturally with dbt.
The stack processes millions of property records daily with quick turnaround for custom analytics requests. Cost is significantly lower than traditional data warehouse approaches because there's no cluster management, no per-query pricing, and no idle compute cost.
The developer experience matters here too. A team that understands the tools and can debug problems quickly ships faster and makes fewer mistakes. We deliberately avoided tools that require specialized expertise to operate.
The architecture also handles change well. When the client needed to add a new data source or modify the schema, the impact was isolated and the change was testable before deployment.
The pipeline runs daily, ingesting property records from multiple sources, transforming and validating the data through dbt models, and making the results queryable for the client's analytics layer. Dagster provides visibility into every run — what succeeded, what failed, and why.
When something breaks — and it will — the error handling and monitoring built into the stack make diagnosis straightforward. There's no black box.
Modern data platforms don't need to be complex or expensive to handle serious data volumes. The tools that existed five years ago required distributed systems at this scale. DuckDB, dbt, and Dagster didn't. Choosing the right tools for the actual problem — not for the hypothetical future problem — is the difference between a platform that works and one that becomes a liability.
If you're building a data platform for real estate, financial services, or any domain with high-volume structured data, we're happy to talk through the architecture.
Is your team building a data platform? Let's talk →
















