4 results (0.024 seconds)
DAGWorks is based on Hamilton, an open-source project that we created and recently forked (https://github.com/dagworks-inc/hamilton). Hamilton is a set of high-level conventions for Python functions that can be automatically converted into working ETL pipelines. To that, we're adding a closed-source offering that goes a step further, plugging these functions into a wide array of production ML stacks.
ML pipelines consist of computational steps (code + data) that produce a working statistical model that a business can use. A typical pipeline might be (1) pull raw data (Extract), (2) transform that data into inputs for the model (Transform), (3) define a statistical model (Transform), (4) use that statistical model to predict on another data set (Transform) and (5) push that data for downstream use (Load). Instead of “pipeline” you might hear people call this “workflow”, “ETL” (Extract-Transform-Load), and so on.
Maintaining these in production is insanely inefficient because you need both data scientists and software engineers to do it. Data scientists know the models and data, but most can't write the code needed to get things working in production infrastructure—for example, a lot of mid-size companies out there use Snowflake to store data, Pandas/Spark to transform it, and something like databrick's MLFlow to handle model serving. Engineers can handle the latter, but mostly aren't experts in the ML stuff. It's a classic impedance mismatch, with all the horror stories you'd expect—e.g. when data scientists make a change, engineers (or data scientists who aren’t engineers) have to manually propagate the change in production. We've talked to teams who are spending as much as 50% of their time doing this. That's not just expensive, it's gruntwork—those engineers should be working on something else! Basically, maintaining ML pipelines over time sucks for most teams.
One way out is to hire people who combine both skills, i.e. data scientists who can also write production code. But these are rare and expensive, and in our experience they usually are only expert at one side of the equation and not as good at the other.
The other way is to build your own platform to automatically integrate models + data into your production stack. That way the data scientists can maintain their own work without needing to hand things off to engineers. However, most companies can't afford to make this investment, and even for the ones that can, such in-house layers tend to end up in spaghetti code and tech debt hell, because they're not the company's core product.
Elijah and I have been building data and ML tooling for the last 7 years, most recently at Stitch Fix, where we built a ML platform that served over 100 data scientists from various modeling disciplines (some of our blog posts, like [1], hit the front page of HN - thanks!). We saw first hand the issues teams encountered with ML pipelines.
Most companies running ML in production need a ratio of 1:1 or 2:1 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 10:1—way more efficient—because they can afford to build the kind of platform described above. With DAGWorks, we want to bring the power of an intuitive ML Pipeline platform to all data science teams, so a ratio of 1:1 is no longer required. A junior data scientist should be able to easily and safely write production code without deep knowledge of underlying infrastructure.
We decided to build our startup around Hamilton, in large part due to the reception that it got here [2] - thanks HN! We came up with Hamilton while we were at Stitch Fix (note: if you start an open-source project at an employer, we recommend forking it right away when you start a company. We only just did that and left behind ~900 stars...). We are betting on it being our abstraction layer to enable our vision of how to go about building and maintaining ML pipelines, given what we learned at Stitch Fix. We believe a solution has to have an open source component to be successful (we invite you to check out the code). In terms of why the name DAGWorks? We named the company after Directed Acyclic Graphs because we think the DAG representation, which Hamilton also provides, is key.
A quick primer on Hamilton. With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code:
df['col_c'] = df['col_a'] + df['col_b']
You would write:
def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series:
"""Creating column c from summing column a and column b."""
return col_a + col_b
Then if you wanted to create a new column that used `col_c` you would write:
def col_d(col_c: pd.Series) -> pd.Series:
# logic
These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result. Since you’re forced to write functions, everything becomes unit testable and documentation friendly, with the ability to display lineage. You can kind of think of Hamilton as "DBT for python functions", if you know what DBT is. Have we piqued your interest? Want to go play with Hamilton? We created https://www.tryhamilton.dev/ leveraging pyodide (note it can take a while to load) so you can play around with the basics without leaving your browser - it even works on mobile!
What we think is cool about Hamilton is that you don’t need to specify an “explicit pipeline declaration step”, because it’s all encoded in the function and parameter names! Moreover, everything is encapsulated in functions. So from a framework perspective, if we wanted to (for example) log timing information, or introspect inputs/outputs, delegate the function to Dask or Ray, we can inject that at a framework level, without having to pollute user code. Additionally, we can expose "decorators" (e.g. @tag(...)) that can specify extra metadata to annotate the DAG with, or for use at run time. This is where our DAGWorks Platform fits in, providing off-the-shelf closed source extras in this way.
Now, for those of you thinking there’s a lot of competition in this space, or what we’re proposing sounds very similar to existing solutions, here’s some thoughts to help distinguish Hamilton from other approaches/technology: (1) Hamilton's core design principle is helping people write more maintainable code; at a nuts and bolts level, what Hamilton replaces is procedural code that one would write. (2) Hamilton runs anywhere that python runs: notebook, a python script, within airflow, within your python web service, pyspark, etc. E.g. People use Hamilton for executing code in batch tasks and online web services. (3) Hamilton doesn't replace a macro orchestration system like airflow, prefect, dagster, metaflow, zenML, etc. It runs within/uses them. Hamilton helps you not only model the micro - e.g. feature engineering - but can also help you model the macro - e.g. model pipelines. That said, given how big machines are these days, model pipelines can commonly run on a single machine - Hamilton is perfect for this. (4) Hamilton doesn't replace things like Dask, Ray, Spark -- it can run on them, or delegate to them. (5) Hamilton isn't just for building dataframes, though it’s quite good for that, you can model any python object creation with it. Hamilton is data type agnostic.
Our closed source offering is currently in private beta, but we'd love to include you in it (see next paragraph). Hamilton is free to use (BSD-3 license) and we’re investing in it heavily. We’re still working through pricing options for the closed source platform; we think we’ll follow the leads of others in the space like Weights & Biases, and Hex.tech here in how they price. For those interested, here’s a video walkthrough of Hamilton, which includes a teaser of what we’re building on the closed source side - https://www.loom.com/share/5d30a96b3261490d91713a18ab27d3b7.
Lastly, (1) we’d love feedback on Hamilton (https://github.com/dagworks-inc/hamilton) and on any of the above, and what we could do better. To stress the importance of your feedback, we’re going all-in on Hamilton. If Hamilton fails, DAGWorks fails. Given that Hamilton is a bit of a “swiss army knife” of what you could do with it, we need help prioritizing features. E.g. we just released experimental PySpark UDF map support, is that useful? Or perhaps you have streaming feature engineering needs where we could add better support? Or you want a feature to auto generate unit test stubs? Or maybe you are doing a lot of time-series forecasting and want more power features in Hamilton to help you manage inputs to your model? We’d love to hear from you! (2) For those interested in the closed source DAGWorks Platform, you can sign up for early access via www.dagworks.io (leave your email, or schedule a call with me) – we apologize for not having a self-serve way to onboard just yet. (3) If there’s something this post hasn’t answered, do ask, we’ll try to give you an answer! We look forward to any and all of your comments!
Here’s how we did it:
Last summer (after having conversations with a few dozen college students), we stumbled across a very interesting insight:
*80% of college students aren’t on Facebook anymore*.
Sure, tools like Snapchat and Instagram exist that represent someone’s social graph, but there isn’t a central place for students to go get guidance from other peers. For students to even do simple transactions like exchanging textbooks, finding roommates, or buying sports tickets is messy.
This gave us the idea that just maybe students were hungry for a new social network to fill the void Facebook left behind.
Our goal was to create a digital community where students could discuss campus life, discover housing options, and buy/sell from other students.
### Step 1: Build the Platform
To make this happen, we started by setting up a (Circle)[https://circle.so/] community specifically for our first college, the University of Michigan (Go Blue!). We then customized the community's branding and design to fit with UofM’s colors and overall aesthetic.
Next, we created different spaces within the community for different topics, such as "Ask UMich," "Class Reviews," “Buy/Sell,” and "Events."
Lastly, we then threw together a marketing page using Webflow and used Zapier to sync all user content into an Airtable to run analytics.
*Tech Stack:*
Marketing website: Webflow (https://www.peervine.io/)
Community: Circle
Database: Airtable
Syncing: Zapier
Operations: Retool
### Step 2: Launch
Getting students into the platform was easy enough. We just showed up on campus and bribed them with free cookies and pizza. This got us to our first 500 students on the platform.
We made sure to promote the community heavily on campus through flyers, social media posts, and announcements at student events.
*Step 2.5 (optional): Build a mobile app*
We didn’t have an app when we launched and quickly realized that was going to be a problem.
Our hack was simply to create a React Native app that rendered a web view wrapped around our Circle community. It wasn’t ideal, but it got the job done).
### Step 3: Growth
After launch, most of our effort was focused on growing our community. We used a variety of different tactics, from referral contests to even a match-making service that 3x-ed our growth.
Within 12 weeks, had more than 10% of Michigan's undergrad population on the platform (roughly 3,000+ students).
## WHY did we do this?
We get asked this a lot. Since we have our engineering team, why not build this in-house? It came down to this:
It was cheaper and faster.
We’ve learned that most of the components of a social network (feed, profiles, messaging, events, etc…) are all a commodity. What matters much more is who is on the platform and whether or not they are deriving value from the network.
So we decided to focus energy on building the community instead of reinventing the wheel.
Plus, we really didn’t know what our audience would resonate with from a product perspective. Circle allowed us to spin up experiments in just hours rather than weeks. When you’re a small startup, these savings make a *monumental* difference.
### What’s Next?
We’re fundraising! We’ve proven that the model can work on one campus. Now, we want to launch at new campuses.
## Reflections
Through this process, we learned the power of the No Code ecosystem.
Too many founders (including myself) feel the need to build products for the sake of building instead of finding scrappy ways to prove (or disprove) their hypothesis.
But users don’t care if you built your own platform. They only care whether or not you solve their problem.
In the end, that’s what really matters.