š¤ What An MLOps Engineer Does š»
š And What The Week Can Look Like
Table of contents
Sec 1: Introduction ā Who am I and why this blog post?
Hey there! My name is Mikiko and (at the time of writing) I lead MLOps at Featureform.
Before joining Featureform, I worked as a:
Sr MLOps Engineer at Mailchimp (Intuit)
Data Scientist at Teladoc
Data Analyst at Sunrun & WalkMe
A hybrid Data Analyst/Data Scientist at Autodesk
And a hodgepodge of roles at various early-stage startups and SMBs.
Most importantly Iāve sat on the engineering hiring committee at Mailchimp, including designing and leading technical interviews for MLOps & Data Engineers, and mentored at two data science bootcamps.
As a senior member of the MLOps team, I regularly conducted interviews where half the candidates didnāt seem to understand the role they were being interviewed for and were unprepared in their answers as well as their pre-interview study.
And who can blame them?!
Heck, I remember when I was trying to make the transition from data scientist to MLE how varied and unclear the definition of the MLE role was, especially in planning a learning roadmap.
Hopefully, by the end of this blog post, youāll have a clearer understanding of what the role entails and even how an MLOps Engineerās day-to-day is spent.
Sec 2: What is an MLOps Engineer?
In this post āDefining MLOps as Simply As Possibleā I defined MLOps (or Machine Learning Operations) as follows:
- MLOps is the practice of productionizing machine learning artifacts in a scalable and reliable manner, where āartifactsā can include projects, applications, services, and pipelines.
And then noted:
An MLOps System or Platform is a collection of tooling and processes that enables the systematic development and productionization of machine learning artifacts.
An MLOps Team is a collection of individuals focused on the design, development, and maintenance of the MLOps System (or Platform).
What about an MLOps Engineer then?
Sec 3: What is their scope of responsibilities?
Based on the definitions Iāve put forth, it follows that I define an MLOps Engineer as a:
- Developer & maintainer of the tooling & infrastructure that supports data science development and deployment.Ā
In larger and more mature companies, these are NOT the actual model developers themselves or the productionizers of models (however, there are always exceptions).
In many companies, especially startups or SMBs (or even new teams within a more established company) itās often expected that individuals wear multiple hats. The model developer is often also building the tools and infrastructure that facilitate their work.
In companies and industries that are incredibly specialized (like self-driving cars and robotics in the early days) this was (and still is) unavoidable because the tools didnāt exist and if they did exist in the open-source ecosystem they still needed to be further customized.
This is why itās important to not define an MLOps engineer strictly as a specific title but as an individual or team that supports the set of practices and tools that collectively solve the unique challenges of software due to machine learning.
With that being said, in many cases, asking the people flying the airplane to build it at the same time is a quick way to burn out an individual or a team. Restructuring or adapting an existing team or org can also be tricky for various reasons (political, strategic, etc).
In this blog post weāre going to speak about the MLOps role as if itās the more specialized persona.
Most of what I note will still apply for the hybrid roles but assume that some additional tasks or responsibilities are not being described below.
Sec 4: Detailed Breakdown of responsibilities
The Overall Buckets
Based on the definitions described above, the two most obvious categories of work an MLOps Engineer is responsible for fall into:
Dev - Creating, optimizing, maintaining, and deprecating the components of the infrastructure that supports model development, productionisation & deployment.
Ops - Ensuring adoption and enablement of the tools and processes that are meant to facilitate the data science teams' work, including unblocking projects as needed or offering support in areas where tooling and automation havenāt been created yet.
Depending on how senior an engineer is, they may also have additional responsibilities such as mentoring, hiring and interviewing, sitting on cross-functional initiatives, and interfacing with other engineering teams. These are responsibilities that cut across all software engineering disciplines and arenāt specific to MLOps.
Although the goal is to minimize operational work as much as possible through automation, in some roles the ratio of āDevā to āOpsā work has been roughly 30% vs 70% (especially teams that have low automation), whereas in other teams itās been closer to 60% vs 40%.
Breaking Down the Dev + Ops Buckets of MLOps
Letās break down the responsibilities & category of work even further.
For all the buckets, the goal is to enable new capabilities or efficiencies for data scientists & ML engineers.
How this goal is tackled can be broken out into tasks that are either focused on the tooling & infrastructure layer or people & workflow layers.
Dev Bucket
For the āDevā bucket, the responsibilities are targeted at the platform and tool level. They include:
Developing infrastructure & tools ā Using a combination of custom internal tools, public cloud, open-source, or external proprietary;
Refactoring & optimizing existing infrastructure;
Including fixing bugs in our tooling (which usually come up during on-call or data science project consulting)
Minimizing or paying back tech debt
Maintenance of tools (and occasionally pipelines);
- Usually captured as tickets in the backlog that we need to get but maybe werenāt critical at that time
Deprecating tools & components (always fun & satisfying).
Some examples of how these activities are performed or how they come up on the platform or tool level:
ā”ļø Meeting with data science and data engineering to figure out if there are gaps in our MLOps stack that need to be addressed by new tooling, whether open-source or cloud vendor based. This kicks off the process of requirements gathering, writing a tech spec, building a POC, and then testing and evaluating the tool before releasing for use by the data scientists.
ā”ļø Prioritizing and implementing bug fixes for our current processes and tooling. Maybe we didnāt pin a specific version for a dependency and now itās causing havoc for the data scientists. Maybe we didnāt test whether our dev environments are having issues with GPU support. Maybe an upstream dependency in a really popular data science library was changed, wasnāt communicated, and now lots of people across the internet are having issues using the latest version and we need to give the data science team guidance in the short-term.
ā”ļø Finally getting to those tickets we parked because they werenāt a high priority at the time, either because they were nice-to-have features or they didnāt block main development. Maybe the data scientists wanted a different documentation generation tool. Maybe there were new testing libraries that looked interesting and that we could implement in our packaging process.
Ops Bucket
For the Ops bucket, which is targeted at the workflows and processes level (i.e. the interaction of People with Technology), this includes:
Defining best practices and ensuring pipelines & models adhere to those best practices through code as well as non-technical processes;
Drive adoption of best practices & tools through workshops, office hours, documentation, & code;
Enablement through manual bridging i.e. if there are gaps or rough edges in the current toolchain or areas that are hard to automate, help push projects over the line;
Internal consulting i.e. assisting data scientists in navigating engineering decisions, code reviews, etc.
Specific examples of how an MLOps engineer can provide support at the workflow & processes level:
ā”ļø Embedded project consulting ā This is where the MLOps team acts as an internal solutions consultant, helping to guide the data scientists to the right pattern or architecture for their model package or pipeline.
This can also include (using GCP as an example):
Refactoring code;
Helping them get setup with GCP via Terraform;
Helping them develop tests;
Helping them navigate any difficulties with Docker, Airflow, BigQuery, and writing any shell scripts;
Helping them with adding the necessary credentials to their projects.
We can also help them answer questions like
- āIs it a super complex project? Does it need a different tool than we support like Spark or Dataflow?ā
We can figure out additional tools we need to be looking at by understanding the unique requirements of their project, such as serving and latency requirements, and by pair coding.
This approach is sometimes called the āEmbeddedā or āSquadā model, because you might have specific MLOps Engineers (usually those who have prior experience working as data scientists) assigned to specific data scientists or data science teams and they share in the wins & lows of their data scientists projects.
ā”ļø On-call Support ā On teams that tend to be more āservice-basedā (and are still responsible for ādevā work) getting pinged about bugs and issues can be super disruptive and results in recursive context switching, jumping from one fire to another.
On-call support is a pattern that is mostly seen in teams or organizations that are responsible for production products and environments.
However, on-call support can also be found in teams that deal with pre-production model development and deployment tooling and environments (as well as serving environments).
Sec 5: Distribution of Responsibilities Through Week
How do these responsibilities shake out in the day-to-day or week-to-week?
Iām going to lean very heavily into my own work experiences in this section but would like to note that an individual MLOps Engineerās day-to-day is going to depend very heavily on the following factors:
Their total work hours (40 hrs+?)
Their seniority (i.e. are they managing other engineers, leading initiatives, mentoring, etc ?)
The team culture (Meetings galore or plenty of heads down time?)
Additional responsibilities outside of their engineer role (Sitting on hiring panels, hackathons, contributing to the tech blog, etc)
Maturity of toolchain and workflows.
All these factors will contribute to how much of the time theyāre expected to focus on a single component or ticket, how much time is spent interfacing with other key stakeholders, etc.
Iām going to describe two radically different schedules for my experiences working as an MLOps Engineer at an established company and working as the āData & ML infra gal wearing multiple hatsā at a very early stage startup.
Persona 1: Sr MLOps Engineer at Established Company
Relevant Factors
Company: Email marketing, 20 yrs old
Total workweek: ~40hrs
Remote role (Team based in Atlanta, Iām based in SF)
Team sizes:
Data Scientist: ~20+
MLOpsy Engineers: ~15+
- My team: ~6
Data Engineers: ~15+
Not agile based ā planning ad-hoc or weekly/bi-weekly based without real scrum or backlog grooming
Types of Meetings
Aside from the responsibilities outlined in the earlier section of this blog post:
Recurring Meetings:
Team Meetings: Talk about problems that have come up, asks that might have an impact on our current roadmap, update on OOO time, & update the Brag Deck
1-1ās: With manager and teammates ā These are super important with a bunch of us being remote. And when new members join the team I try to make sure first couple months are at least 1 hr in length weekly
Org meetings: Town hall updates, important company announcements
Variable or ad-hoc:
Project specific check-ins: When a project is being productionized weād usually have a 30 min check-in twice a week (also includes internship project)
Company events like hackathons, etc
Time Breakdown
Category | As In | Hrs per Week (~40hr) |
Dev Bucket | See: Dev Bucket | 5-6+Ā |
Ops Bucket | See: Ops Bucket | 8-9+ |
Sr EngĀ | Mentoring & managing intern | 8 |
Admin | Team meetings (or meetings on projects & with other teams) | 8 |
Not pictured in the schedule
Any ad-hoc meetings or follow-ups;
Self-development time;
Special planning: Quarterly planning, monthly read-outs;
Bug-fixing & patching time;
On-call rotations;
Commuting time for in-person meetings (as rare as they were).
Persona 2: Data + ML Engineer (Part-Time to Full-Time) at Startup
Relevant Factors
Company: Pre-round A Real-Estate Tech
Total workweek: ~20 ā”ļø~40hrs
Remote role (Team based in LA, Iām based in SF)
Team size:
- <10 people
Build the plane while flying it - Operating & accumulating strategic technical debt
Biggest Differences Between Early-Stage Startup & Established Company
The main differences between the schedule pictured above and working as the āData & ML Personā at a startup thatās building out its ML platform while building the main product:
Swap out most meetings for head's down dev timeā small team, so anything you need to say or ask for youād do it automatically;
Less legacy code or technical debt because building from scratch ā so less cross-functional time;
No org meetings or updates;
Check-ins on largely done online;
Code is shipped as soon as possible;
You spend as much time talking about the product and potential monetization streams as you do code.
Time Breakdown
Category | As In | Hrs per Week (~40hr) |
Dev Bucket | See: Dev Bucket | Most of the time |
Ops Bucket | See: Ops Bucket | 3-4 hrs |
Admin + Strategy/Product | Strategy & product meetings | 4-5 hrs |
Sec 6: Closing
Although this was a long read, Iām hoping that at the end of this post, you have a really great idea as to what the week-to-week of an MLOps Engineer could look like, either at an early-stage startup or at a more established company.
Before closing, I want to emphasize a few points, however:
Titles ā Titles are a finicky thing. Theyāre meant to serve as useful heuristics so that when someone asks what you do at a dinner party, you can shorthand the myriad of responsibilities, tasks, and ways you provide value in your role. But like all heuristics, titles are imperfect. Theyāre meant to be useful, rather than accurate. Keep that in mind, especially as you meet MLOps engineers with titles, and MLOps engineers without.
Implementation ā The maturity and size of your org and company will determine the vastness and impact of your responsibilities. Data scientists arenāt a monolith and neither are MLOps engineers.
Evolution ā Like all roles, MLOps responsibilities and roles will continue to change as the landscape changes.
And finally, Iāve added some useful links below if youāre interested in learning more about what an MLOps engineer does.
Let me know what you think at any of the following places!
š LinkedIn: https://www.linkedin.com/in/mikikobazeley/
š Medium: https://bit.ly/3wKUwym
š¬ Substack: https://mikikobazeley.substack.com/
š Blog: https://mikiko.hashnode.dev/
š¹ Youtube: https://bit.ly/3MBR8N3
š„ Twitter: https://twitter.com/BazeleyMikiko
šš Github: https://github.com/MMBazel
š¾ Twitch: https://bit.ly/3Akmwfe
š Mastodon: https://data-folks.masto.host/@mikiko
Sec 7: Essential Links & Readings
My Prior Writings about #MLOpsCareers
Relevant Talks & Papers About #MLOpsInAction
Emily Curtinās Talk at Data Council ā Former colleague talks about some of the awesome work sheās done at Mailchimp!
Machine Learning Operations (MLOps): Overview, Definition, and Architecture ā A paper that defines different components of an MLOps system and provides some perspectives on what it currently looks like in industry.
Operationalizing Machine Learning: An Interview Study ā An interview study that also includes interesting quoted comments made by practitioners across a number of different companies and industries.
[Is MLOps Engineer a Thing? We Asked 6 Engineers About It](neptune.ai/blog/mlops-engineer#:~:text=MLOp.. ā A blog post summarizing 6 diverse perspectives of MLOps practitioners and leaders.
Learn More About #MLOps With Me
Learning MLOps (Gradually) for Free Through Blog Posts & Podcasts | by Mikiko Bazeley - I list some of the best podcasts, blog posts, and tweets for new and aspiring MLOps practitioners.
What is MLOps Series ā I try to describe the important parts of MLOps to new and aspiring practitioners.
Parts 4-6 to be published soon!
The Eng Side of #MLOpsCareers
An MLOps Engineer is still an engineer. I link some resources that I think are really useful for folks that are thinking about what an engineer career looks like, with or without ML.
Sec 8: Footnotes
Caveats about āWhat is an MLOps Engineerā
There is a ton of back-&-forth on social media platforms like LinkedIn and Youtube about whether the role of an āMLOps Engineerā is real and how useful having a specialized role in a company is, as well as whether or not DevOps engineers or data scientists should be filling the function. Most of these discussions (and flame wars) tend to neglect the nuance of size, maturity, and age of a company i.e. larger and older companies will tend to have more specialized roles, smaller companies more general roles.
Thereās also an obsessive fixation on titles. Is someone an MLOps Engineer? An ML Engineer? A full-stack data scientist? My goal isnāt to wade into the alligator-infested swamp of those types of discussions.
My takes are based on my specific work experiences, discussions Iāve had with recruiters and hiring managers, job postings, and surveys that have been published.
Engineer sentiment to āsh*t-ops workā or āmanual sh*tā
There are plenty of engineers Iāve talked to who believe that the embedded MLOps engineer pattern is an anti-pattern (as well as the on-call support pattern). The belief is āThat which can be automated, should absolutely be automated. And if it's being done manually, it's because the automation is badā. In theory, I agree with their position. In practice, different teams are at different levels of automation maturity. Aspire for automation but expect some level of manual support, at least until automation is built in.
Further explanation of MLOps on-call support
On-call is a practice where engineers are assigned to be available for a specific period of time to solve any blockers that come up (usually in production and on mission-critical systems).
Typically they are the first line of defense for ticket support. They also provide initial triage to understand whether an issue is due to PEBCAK (problem exists between chair & keyboard) or because of a real bug that needs to be ticketed and patched in the underlying tooling and platform.
MLOps Engineers might be assigned to on-call for a week or even a day and the schedule rotates through the team unless the team has some special shadowing going on for new hires as part of their onboarding process.
In many teams, a specific Slack channel is created where data scientists can post questions and get help from all the MLOps eyeballs watching the channel (as well as their fellow data scientists). If they really need help they can also tag the engineer thatās assigned to on-call with the expectation that an SLA will be met (i.e. they'll get a response in X time).