Back to Blog

Patterns to recommend a framework

The goal of this document is to guide the reader through the rationale behind a technical recommendation, to build trust with the client, and, hopefully, a long-term relationship.

Context

There were two clients: a consultancy partner and the ultimate client a real estate company based in the US. They had spent a lot of time and money trying to replace Salesforce but failed. Therefore we need to have quick wins to ensure the continuation of the project.

Challenge

Build a data pipeline that would fetch data from multiple sources and produce reliable information that would free up the agents’ time. The input data is text.

Constraints

The CTO was clear that the technologies must be open-source and compatible with the existing infrastructure (AWS, Terraform, Docker, K8’s, PostgreSQL, etc.). On the other hand, we had to be flexible about the data mappings and marts configurations.

Decision process

With commercial products out of the window, I focus on Apache Airflow and Dagster an open-source project that was built by Data Engineers.

On top of that, I view self-hosted applications as an asset for custom requirements thus honoring the flexibility principle. Both frameworks have an official Helm chart.

The Developer Experience (DX) and speed: Dagster was the winner because shines on this aspect by being declarative. Whereas Airflow is imperative, this reduces the agility to prevent logic combination.

The RBAC argument: Airflow allows you to implement your custom solution for that. Meanwhile, Dagster charges you $100 per three seats and provides out-of-the-box integration with Okta and Github SSO which the client already uses. As turns out, the client was not only fine with paying that amount in the future but did not require that to allow engineers to access the Dagster console using RBAC. They are accessing the self-hosted Dagster console by running kubectl port-forward svc/dagster-webserver port:port after AWS SSO via cli.

Regarding the performance we want the Database to do the heavy lifting. In many benchmarks, Dask was outperformed by Spark and dbt. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or AWS EMR cluster. On the other hand, the simplicity of dbt (pure SQL) and the abundance of Python libraries such as boto3, pandas, etc. made it a clear winner.

less is better. dbt was running on BigQuery

source: Working with Large Datasets: BigQuery (with dbt) vs. Spark vs. Dask

It’s worth mentioning that we were using csv and parquet as data formats.

Speaking of testability pytest and moto (mock_aws) would be enough to exercise the business logic, after all, we are writing just functions. Could not say the same for PySpark (Python’s API for Apache Spark) which has a complex setup with EMR and Databricks.

Outcome

Fortunately, all of this analysis paid off and we quickly adapted to pivotal changes such as data mapping and customer requirements.

We are celebrating the SOW extension due to the final customer satisfaction.

Modern Data Architecture

How We Built a Scalable Real Estate Analytics Platform

I'm excited to share insights from our recent success in building a modern data platform for real estate analytics. Here's how we approached it and why it worked so well.

The Challenge

We needed to process millions of real estate records daily, including property data, mortgages, tax assessments, and ownership information. The solution had to be cost-effective, maintainable, and flexible enough for custom analytics.

Our Data Stack

We built our solution on three pillars:

1. Smart Storage Choices

We chose Amazon S3 with Parquet file format as our foundation. This combination gives us:

  • Cost-effective storage for large datasets
  • Fast query performance
  • Flexible schema evolution
  • Built-in compression

2. Powerful Processing Engine

For data processing, we combined DuckDB with dbt (data build tool):

  • Zero-configuration analytics
  • SQL-first approach for accessibility
  • Built-in testing and documentation
  • Clear data lineage
  • Exceptional query performance

3. Modern Orchestration

We selected Dagster for orchestration because it offers:

  • Clear visibility into data flows
  • Developer-friendly experience
  • Easy testing and debugging
  • Robust error handling
  • Cost-effective deployment

Why It's Revolutionary

  1. Simplicity Wins: We avoided complex distributed systems in favor of simple, powerful tools.
  2. Cost-Effective: No expensive cluster management or infrastructure.
  3. Developer Joy: Our team can focus on data logic instead of infrastructure.
  4. Business Agility: Changes and new features can be implemented quickly.
  5. Reliable Operations: Built-in monitoring and error handling keep things running smoothly.

Real-World Impact

  • Processing millions of property records daily
  • Quick turnaround for custom analytics requests
  • Significant cost savings compared to traditional solutions
  • Happy developers with improved productivity
  • Flexible system that grows with business needs

Key Learnings

  1. Simple > Complex: Choose simplicity when possible
  2. Developer Experience Matters: Happy developers = better products
  3. Cost-Effectiveness: Modern tools can do more with less
  4. Future-Proof: Build for change, not permanence

This architecture proves that modern data platforms don't need to be complex or expensive to be powerful. By choosing the right tools and focusing on simplicity, we've built a solution that's both robust and joy to work with.

What's your experience with modern data architectures? I'd love to hear your thoughts!

#DataEngineering #ModernDataStack #RealEstate #Technology #Innovation #DataAnalytics #Engineering #TechArchitecture

Building a Scalable Real Estate Analytics Platform

Client Challenge

Our client, a real estate company, faced the daunting task of processing millions of records daily, including property data, mortgages, tax assessments, and ownership information. The existing infrastructure struggled to keep up, and the client needed a solution that was:

  • Cost-effective: To reduce operational expenses.
  • Maintainable: For long-term reliability and ease of use.
  • Flexible: To support custom analytics and adapt to future needs.

Without a streamlined system in place, delays in data processing were impacting business agility and analytics-driven decision-making.

Proposed Solution

We designed and implemented a modern data platform focused on simplicity, efficiency, and scalability. The solution was built on three foundational pillars:

  1. Smart Storage Choicessome textsome text
    • Selected Amazon S3 with Parquet file format to manage large datasets.
    • Key benefits:some textsome text
      • Cost-effective storage.
      • Fast query performance.
      • Flexible schema evolution for changing data structures.
      • Built-in compression for efficient data storage.
  2. Powerful Processing Enginesome textsome text
    • Combined DuckDB with dbt (Data Build Tool) for seamless data processing and analytics.
    • Delivered:some textsome text
      • Zero-configuration analytics for ease of deployment.
      • A SQL-first approach to make data accessible to the client’s teams.
      • Built-in testing, documentation, and data lineage for reliability and traceability.
      • Exceptional query performance without the complexity of distributed systems.
  3. Modern Orchestrationsome textsome text
    • Used Dagster to orchestrate workflows and manage data pipelines effectively.
    • Provided:some textsome text
      • Clear visibility into data flows for better monitoring.
      • A developer-friendly experience with easy debugging and testing.
      • Robust error handling to ensure consistent operations.
      • Cost-effective deployment through integration with the client’s existing infrastructure.

Results Achieved

The new data platform delivered significant improvements in both operations and business outcomes:

  • Operational Efficiency: Processed millions of real estate records daily, enabling faster insights and quicker decision-making.
  • Cost Savings: Eliminated the need for expensive distributed systems and reduced infrastructure overhead.
  • Developer Productivity: Simplified workflows allowed developers to focus on data logic rather than infrastructure management.
  • Agility: The client could implement changes and new features quickly, keeping up with dynamic business requirements.
  • Scalability: Built a flexible system that can grow with the client’s evolving needs.

Key Metrics:

  • Daily Record Processing: Millions of records processed efficiently.
  • Cost Efficiency: Achieved significant savings compared to traditional approaches.
  • Developer Satisfaction: Improved productivity and reduced complexity in day-to-day tasks.

Executive Summary

To address the challenge of processing large volumes of real estate data, we built a modern data platform leveraging Amazon S3, dbt, DuckDB, and Dagster. This solution provided the client with a scalable, cost-effective, and developer-friendly system capable of handling their data needs with efficiency and reliability. The project’s success ensured faster analytics, improved operations, and positioned the client for long-term growth.

Latest Articles

Modern Data Architecture How We Built a Scalable Real Estate Analytics Platform

December 26, 2024

Patterns to recommend a framework

December 18, 2024
Rodrigo Azziani

DevPod - An Alternative to GitHub Codespaces

Engineering
November 22, 2024

Demystifying cloud infrastructure: a simple guide for business leaders.

November 22, 2024

The Benefits of Custom Software Development

November 22, 2024
Maximiliano Aguirre

Transform Your Business with Renaiss' Consulting Services

Engineering
November 12, 2024
Mauro Abbatemarco

Introduction to DevOps: Bridging the Gap Between Development and Operations

Engineering
November 6, 2024
Maximiliano Aguirre

Building scalable software architectures

Engineering
November 6, 2024
Mauro Abbatemarco

The Importance of Continuous Integration and Continuous Deployment (CI/CD)

November 1, 2024
Rodrigo Azziani

Best Practices in Agile Software Development

Engineering
October 18, 2024
Mauro Abbatemarco

The Role of Advisory Services in Strategic IT Planning

Engineering
October 18, 2024
Rolando Cabrera

Cost efficiency in cloud services: maximizing ROI

October 18, 2024
Rodrigo Azziani

Enhancing business efficiency with AWS Managed Services

October 9, 2024
Joaquin Colombo

Comprehensive guide to cloud migration services

October 7, 2024
Joaquin Colombo

Cybersecurity in the Cloud: Best practices for protection

Engineering
September 18, 2024
Rodrigo Azziani

The Benefits of Nearshoring with Renaiss

Engineering
September 17, 2024
Mauro Abbatemarco

Demystifying cloud infrastructure: a simple guide for business leaders.

Engineering
September 11, 2024
Rolando Cabrera

What is AWS Certification?

Engineering
May 23, 2024
Mauro Abbatemarco

Essential certifications for a cloud engineer

Engineering
May 23, 2024
Rodrigo Azziani

Global Talent Search: The Growing Integration of Argentine Professionals in 2023

Engineering
April 23, 2024
Mauro Abbatemarco

Crossing Borders Virtually: The Rise of Argentine IT Professionals in Global Companies in 2023

Web Development
April 23, 2024
Rolando Cabrera

Navigating the Shift: The Surge of IT Professionals Changing Jobs in Argentina in 2023

Engineering
April 23, 2024
Mauro Abbatemarco

LATAM Tech Talent Surge in US Companies

Tech
April 23, 2024
Joaquin Colombo

Key Tech Certifications in 2024: Advancing IT Careers

Engineering
April 23, 2024
Rolando Cabrera

In-Demand IT Roles in 2024: Opportunities and Challenges

Web Development
April 23, 2024
Renaiss © Code | Designed by us with love
Renaissance Software LLC