Coalesce 2024 in Las Vegas was tons of fun, meeting old and new friends. We co-hosted a happy hour and gave out some very cool fans. It was action-packed, and a few of us are co-conspiring some new initiatives. Stay tuned!
The One dbt Vision
There are already quite a few summaries about what dbt announced and how it felt differently transformed to an industry conference.
In the keynote, we saw dbt’s vision to bring software engineering practices to data and broaden the audience. We also saw feature announcements like a visual editor, multi-platform and Iceberg support, collaboration, and AI copilot. dbt Labs is expanding from just the transform layer (the T in ELT) to the unified data control plane.
To be that platform, it has to be seen doing a bit of everything across the board - orchestration, governance, and observability. The platform might still encourage best-in-class tools for integration, but all basic needs are provided, so customers don’t have to assemble 5 different tools to get the business value.
Is this alarming to vendors in the adjacency? Yes. Eventually, this means customers need either something unconventional or not just using dbt to have incentives for another tool in the stack.
The Challenge of Applying Software Development Practices to Data
I want to discuss the advanced CI feature in dbt Cloud, as it directly relates to Recce. CI means continuously running integration tests during development before deployment. dbt’s Advanced CI is an Enterprise feature showing potential data changes after code changes. Users can be better informed and confident about pushing code changes to production. This classic software development approach boosts productivity by previewing change impacts.
We’re thrilled to see more tools and main players like dbt addressing the data development workflow and code change confidence issues.
In my opinion, the CI experience is the pivotal difference between software engineering and data/analytics engineering. This is because fundamentally how we test data systems is different from testing software. dbt pioneered easier scaffolding for testing data, providing building blocks like branched development and unit testing.
Like advanced CI, the data impact analysis is something we built a year ago. We got polite responses like “this looks cool.” Users said the approach creates too much noise for important changes, making it harder for the reviewer. There’s a related talk in Coalesce this year by Aiven about how more tests erode data quality.
That’s why directly translating software best practices onto the data workflow has felt like fitting a square peg in a round hole. Yes, we should unit test and do CI for data, but in reality, adoption is far from ideal.
I spoke with
at one of the after-parties. He pointed out that data quality shouldn’t be a standalone tool but part of the data platform. I agree to some extent. That’s why I avoid talking about data quality in general. Most people will say data quality is their priority, but secretly keep that as a scapegoat because it’s multifaceted and easy to say it’s Somebody Else’s Problem.During the conversation, I said: “dbt is always about bringing software practices to data, but as someone with experience in both worlds, software development is becoming a more experimental workflow especially with ML & LLM-based systems. The data workflow can inspire software engineering in the future.”
Reimagining Data Workflows with Recce
Recce is not meant to be a data quality tool but a reimagination of the software development workflow for data-centric systems. The first step is supporting declarative data systems like dbt, and users like City of Rio find the workflow natural and productive in their change review process.
How is Recce different or sufficiently advanced? We learn from mature data workflows like the Cal-ITP project that defines what good looks like upfront and how to validate correctness by comparing data or query results. It is a major shift from just dumping data differences. This approach empowers domain experts who are already familiar with the context of the change to curate the proof of correctness with useful tools like distribution changes and query result comparisons. The result? It allows stakeholders, reviewers, and authors to have a shared understanding of what correctness means, so they can effectively validate the changes from technical and business perspectives, reducing 1 day of scrutiny to 1 hour.
Best of all, the main Recce experience is open source. If you have an open dbt pull request you’re hesitant to merge, try Recce and let us know your thoughts!