The goal of this document is to guide the reader through the rationale behind a technical recommendation, to build trust with the client, and, hopefully, a long-term relationship.
There were two clients: a consultancy partner and the ultimate client a real estate company based in the US. They had spent a lot of time and money trying to replace Salesforce but failed. Therefore we need to have quick wins to ensure the continuation of the project.
Build a data pipeline that would fetch data from multiple sources and produce reliable information that would free up the agents’ time. The input data is text.
The CTO was clear that the technologies must be open-source and compatible with the existing infrastructure (AWS, Terraform, Docker, K8’s, PostgreSQL, etc.). On the other hand, we had to be flexible about the data mappings and marts configurations.
With commercial products out of the window, I focus on Apache Airflow and Dagster an open-source project that was built by Data Engineers.
On top of that, I view self-hosted applications as an asset for custom requirements thus honoring the flexibility principle. Both frameworks have an official Helm chart.
The Developer Experience (DX) and speed: Dagster was the winner because shines on this aspect by being declarative. Whereas Airflow is imperative, this reduces the agility to prevent logic combination.
The RBAC argument: Airflow allows you to implement your custom solution for that. Meanwhile, Dagster charges you $100 per three seats and provides out-of-the-box integration with Okta and Github SSO which the client already uses. As turns out, the client was not only fine with paying that amount in the future but did not require that to allow engineers to access the Dagster console using RBAC. They are accessing the self-hosted Dagster console by running kubectl port-forward svc/dagster-webserver port:port after AWS SSO via cli.
Regarding the performance we want the Database to do the heavy lifting. In many benchmarks, Dask was outperformed by Spark and dbt. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or AWS EMR cluster. On the other hand, the simplicity of dbt (pure SQL) and the abundance of Python libraries such as boto3, pandas, etc. made it a clear winner.
less is better. dbt was running on BigQuery
source: Working with Large Datasets: BigQuery (with dbt) vs. Spark vs. Dask
It’s worth mentioning that we were using csv and parquet as data formats.
Speaking of testability pytest and moto (mock_aws) would be enough to exercise the business logic, after all, we are writing just functions. Could not say the same for PySpark (Python’s API for Apache Spark) which has a complex setup with EMR and Databricks.
Fortunately, all of this analysis paid off and we quickly adapted to pivotal changes such as data mapping and customer requirements.
We are celebrating the SOW extension due to the final customer satisfaction.
Modern Data Architecture
How We Built a Scalable Real Estate Analytics Platform
I'm excited to share insights from our recent success in building a modern data platform for real estate analytics. Here's how we approached it and why it worked so well.
We needed to process millions of real estate records daily, including property data, mortgages, tax assessments, and ownership information. The solution had to be cost-effective, maintainable, and flexible enough for custom analytics.
We built our solution on three pillars:
1. Smart Storage Choices
We chose Amazon S3 with Parquet file format as our foundation. This combination gives us:
2. Powerful Processing Engine
For data processing, we combined DuckDB with dbt (data build tool):
3. Modern Orchestration
We selected Dagster for orchestration because it offers:
This architecture proves that modern data platforms don't need to be complex or expensive to be powerful. By choosing the right tools and focusing on simplicity, we've built a solution that's both robust and joy to work with.
What's your experience with modern data architectures? I'd love to hear your thoughts!
#DataEngineering #ModernDataStack #RealEstate #Technology #Innovation #DataAnalytics #Engineering #TechArchitecture
Our client, a real estate company, faced the daunting task of processing millions of records daily, including property data, mortgages, tax assessments, and ownership information. The existing infrastructure struggled to keep up, and the client needed a solution that was:
Without a streamlined system in place, delays in data processing were impacting business agility and analytics-driven decision-making.
We designed and implemented a modern data platform focused on simplicity, efficiency, and scalability. The solution was built on three foundational pillars:
The new data platform delivered significant improvements in both operations and business outcomes:
Key Metrics:
To address the challenge of processing large volumes of real estate data, we built a modern data platform leveraging Amazon S3, dbt, DuckDB, and Dagster. This solution provided the client with a scalable, cost-effective, and developer-friendly system capable of handling their data needs with efficiency and reliability. The project’s success ensured faster analytics, improved operations, and positioned the client for long-term growth.