The goal of this document is to guide the reader through the rationale behind a technical recommendation, to build trust with the client, and, hopefully, a long-term relationship.

‍

Context

There were two clients: a consultancy partner and the ultimate client a real estate company based in the US. They had spent a lot of time and money trying to replace Salesforce but failed. Therefore we need to have quick wins to ensure the continuation of the project.

‍

Challenge

Build a data pipeline that would fetch data from multiple sources and produce reliable information that would free up the agents’ time. The input data is text.

‍

Constraints

The CTO was clear that the technologies must be open-source and compatible with the existing infrastructure (AWS, Terraform, Docker, K8’s, PostgreSQL, etc.). On the other hand, we had to be flexible about the data mappings and marts configurations.

‍

Decision process

With commercial products out of the window, I focus on Apache Airflow and Dagster an open-source project that was built by Data Engineers.

‍

On top of that, I view self-hosted applications as an asset for custom requirements thus honoring the flexibility principle. Both frameworks have an official Helm chart.

‍

The Developer Experience (DX) and speed: Dagster was the winner because shines on this aspect by being declarative. Whereas Airflow is imperative, this reduces the agility to prevent logic combination.

‍

The RBAC argument: Airflow allows you to implement your custom solution for that. Meanwhile, Dagster charges you $100 per three seats and provides out-of-the-box integration with Okta and Github SSO which the client already uses. As turns out, the client was not only fine with paying that amount in the future but did not require that to allow engineers to access the Dagster console using RBAC. They are accessing the self-hosted Dagster console by running kubectl port-forward svc/dagster-webserver port:port after AWS SSO via cli.

‍

Regarding the performance we want the Database to do the heavy lifting. In many benchmarks, Dask was outperformed by Spark and dbt. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or AWS EMR cluster. On the other hand, the simplicity of dbt (pure SQL) and the abundance of Python libraries such as boto3, pandas, etc. made it a clear winner.

‍

less is better. dbt was running on BigQuery

source: Working with Large Datasets: BigQuery (with dbt) vs. Spark vs. Dask

‍

It’s worth mentioning that we were using csv and parquet as data formats.

‍

Speaking of testability pytest and moto (mock_aws) would be enough to exercise the business logic, after all, we are writing just functions. Could not say the same for PySpark (Python’s API for Apache Spark) which has a complex setup with EMR and Databricks.

‍

Outcome

Fortunately, all of this analysis paid off and we quickly adapted to pivotal changes such as data mapping and customer requirements.

‍

We are celebrating the SOW extension due to the final customer satisfaction.

‍

Modern Data Architecture

How We Built a Scalable Real Estate Analytics Platform

I'm excited to share insights from our recent success in building a modern data platform for real estate analytics. Here's how we approached it and why it worked so well.

‍

The Challenge

We needed to process millions of real estate records daily, including property data, mortgages, tax assessments, and ownership information. The solution had to be cost-effective, maintainable, and flexible enough for custom analytics.

‍

Our Data Stack

We built our solution on three pillars:

1. Smart Storage Choices

We chose Amazon S3 with Parquet file format as our foundation. This combination gives us:

Cost-effective storage for large datasets
Fast query performance
Flexible schema evolution
Built-in compression

2. Powerful Processing Engine

For data processing, we combined DuckDB with dbt (data build tool):

Zero-configuration analytics
SQL-first approach for accessibility
Built-in testing and documentation
Clear data lineage
Exceptional query performance

3. Modern Orchestration

We selected Dagster for orchestration because it offers:

Clear visibility into data flows
Developer-friendly experience
Easy testing and debugging
Robust error handling
Cost-effective deployment

‍

Why It's Revolutionary

Simplicity Wins: We avoided complex distributed systems in favor of simple, powerful tools.
Cost-Effective: No expensive cluster management or infrastructure.
Developer Joy: Our team can focus on data logic instead of infrastructure.
Business Agility: Changes and new features can be implemented quickly.
Reliable Operations: Built-in monitoring and error handling keep things running smoothly.

‍

Real-World Impact

Processing millions of property records daily
Quick turnaround for custom analytics requests
Significant cost savings compared to traditional solutions
Happy developers with improved productivity
Flexible system that grows with business needs

Key Learnings

Simple > Complex: Choose simplicity when possible
Developer Experience Matters: Happy developers = better products
Cost-Effectiveness: Modern tools can do more with less
Future-Proof: Build for change, not permanence

‍

This architecture proves that modern data platforms don't need to be complex or expensive to be powerful. By choosing the right tools and focusing on simplicity, we've built a solution that's both robust and joy to work with.

‍

What's your experience with modern data architectures? I'd love to hear your thoughts!

#DataEngineering #ModernDataStack #RealEstate #Technology #Innovation #DataAnalytics #Engineering #TechArchitecture

‍

Building a Scalable Real Estate Analytics Platform

Client Challenge

Our client, a real estate company, faced the daunting task of processing millions of records daily, including property data, mortgages, tax assessments, and ownership information. The existing infrastructure struggled to keep up, and the client needed a solution that was:

Cost-effective: To reduce operational expenses.
Maintainable: For long-term reliability and ease of use.
Flexible: To support custom analytics and adapt to future needs.

Without a streamlined system in place, delays in data processing were impacting business agility and analytics-driven decision-making.

Proposed Solution

We designed and implemented a modern data platform focused on simplicity, efficiency, and scalability. The solution was built on three foundational pillars:

Smart Storage Choicessome textsome text
- Selected Amazon S3 with Parquet file format to manage large datasets.
- Key benefits:some textsome text
  - Cost-effective storage.
  - Fast query performance.
  - Flexible schema evolution for changing data structures.
  - Built-in compression for efficient data storage.
Powerful Processing Enginesome textsome text
- Combined DuckDB with dbt (Data Build Tool) for seamless data processing and analytics.
- Delivered:some textsome text
  - Zero-configuration analytics for ease of deployment.
  - A SQL-first approach to make data accessible to the client’s teams.
  - Built-in testing, documentation, and data lineage for reliability and traceability.
  - Exceptional query performance without the complexity of distributed systems.
Modern Orchestrationsome textsome text
- Used Dagster to orchestrate workflows and manage data pipelines effectively.
- Provided:some textsome text
  - Clear visibility into data flows for better monitoring.
  - A developer-friendly experience with easy debugging and testing.
  - Robust error handling to ensure consistent operations.
  - Cost-effective deployment through integration with the client’s existing infrastructure.

Results Achieved

The new data platform delivered significant improvements in both operations and business outcomes:

Operational Efficiency: Processed millions of real estate records daily, enabling faster insights and quicker decision-making.
Cost Savings: Eliminated the need for expensive distributed systems and reduced infrastructure overhead.
Developer Productivity: Simplified workflows allowed developers to focus on data logic rather than infrastructure management.
Agility: The client could implement changes and new features quickly, keeping up with dynamic business requirements.
Scalability: Built a flexible system that can grow with the client’s evolving needs.

Key Metrics:

Daily Record Processing: Millions of records processed efficiently.
Cost Efficiency: Achieved significant savings compared to traditional approaches.
Developer Satisfaction: Improved productivity and reduced complexity in day-to-day tasks.

Executive Summary

To address the challenge of processing large volumes of real estate data, we built a modern data platform leveraging Amazon S3, dbt, DuckDB, and Dagster. This solution provided the client with a scalable, cost-effective, and developer-friendly system capable of handling their data needs with efficiency and reliability. The project’s success ensured faster analytics, improved operations, and positioned the client for long-term growth.

‍