Sec 1: Introduction — Who am I and why this blog post?

Hey there! My name is Mikiko and (at the time of writing) I lead MLOps at Featureform.

No alternative text description for this image

Before joining Featureform, I worked as a:

Sr MLOps Engineer at Mailchimp (Intuit)
Data Scientist at Teladoc
Data Analyst at Sunrun & WalkMe
A hybrid Data Analyst/Data Scientist at Autodesk

https://www.linkedin.com/posts/mikikobazeley_productionml-modelcentricai-datacentricai-activity-6990131918625202176-JD81?utm_source=share&utm_medium=member_desktop

And a hodgepodge of roles at various early-stage startups and SMBs.

Most importantly I’ve sat on the engineering hiring committee at Mailchimp, including designing and leading technical interviews for MLOps & Data Engineers, and mentored at two data science bootcamps.

As a senior member of the MLOps team, I regularly conducted interviews where half the candidates didn’t seem to understand the role they were being interviewed for and were unprepared in their answers as well as their pre-interview study.

And who can blame them?!

Heck, I remember when I was trying to make the transition from data scientist to MLE how varied and unclear the definition of the MLE role was, especially in planning a learning roadmap.

Hopefully, by the end of this blog post, you’ll have a clearer understanding of what the role entails and even how an MLOps Engineer’s day-to-day is spent.

Sec 2: What is an MLOps Engineer?

In this post “Defining MLOps as Simply As Possible” I defined MLOps (or Machine Learning Operations) as follows:

MLOps is the practice of productionizing machine learning artifacts in a scalable and reliable manner, where “artifacts” can include projects, applications, services, and pipelines.

And then noted:

An MLOps System or Platform is a collection of tooling and processes that enables the systematic development and productionization of machine learning artifacts.
An MLOps Team is a collection of individuals focused on the design, development, and maintenance of the MLOps System (or Platform).

https://www.linkedin.com/posts/mikikobazeley_mlops-mlplatform-mlproducts-activity-7005924079492808706-WsOa?utm_source=share&utm_medium=member_desktop

What about an MLOps Engineer then?

Sec 3: What is their scope of responsibilities?

Based on the definitions I’ve put forth, it follows that I define an MLOps Engineer as a:

Developer & maintainer of the tooling & infrastructure that supports data science development and deployment.

In larger and more mature companies, these are NOT the actual model developers themselves or the productionizers of models (however, there are always exceptions).

In many companies, especially startups or SMBs (or even new teams within a more established company) it’s often expected that individuals wear multiple hats. The model developer is often also building the tools and infrastructure that facilitate their work.

In companies and industries that are incredibly specialized (like self-driving cars and robotics in the early days) this was (and still is) unavoidable because the tools didn’t exist and if they did exist in the open-source ecosystem they still needed to be further customized.

This is why it’s important to not define an MLOps engineer strictly as a specific title but as an individual or team that supports the set of practices and tools that collectively solve the unique challenges of software due to machine learning.

With that being said, in many cases, asking the people flying the airplane to build it at the same time is a quick way to burn out an individual or a team. Restructuring or adapting an existing team or org can also be tricky for various reasons (political, strategic, etc).

In this blog post we’re going to speak about the MLOps role as if it’s the more specialized persona.

Most of what I note will still apply for the hybrid roles but assume that some additional tasks or responsibilities are not being described below.

Sec 4: Detailed Breakdown of responsibilities

The Overall Buckets

Based on the definitions described above, the two most obvious categories of work an MLOps Engineer is responsible for fall into:

Dev - Creating, optimizing, maintaining, and deprecating the components of the infrastructure that supports model development, productionisation & deployment.
Ops - Ensuring adoption and enablement of the tools and processes that are meant to facilitate the data science teams' work, including unblocking projects as needed or offering support in areas where tooling and automation haven’t been created yet.

Depending on how senior an engineer is, they may also have additional responsibilities such as mentoring, hiring and interviewing, sitting on cross-functional initiatives, and interfacing with other engineering teams. These are responsibilities that cut across all software engineering disciplines and aren’t specific to MLOps.

Although the goal is to minimize operational work as much as possible through automation, in some roles the ratio of “Dev” to “Ops” work has been roughly 30% vs 70% (especially teams that have low automation), whereas in other teams it’s been closer to 60% vs 40%.

Breaking Down the Dev + Ops Buckets of MLOps

Let’s break down the responsibilities & category of work even further.

For all the buckets, the goal is to enable new capabilities or efficiencies for data scientists & ML engineers.

How this goal is tackled can be broken out into tasks that are either focused on the tooling & infrastructure layer or people & workflow layers.

Dev Bucket

For the “Dev” bucket, the responsibilities are targeted at the platform and tool level. They include:

Developing infrastructure & tools – Using a combination of custom internal tools, public cloud, open-source, or external proprietary;
Refactoring & optimizing existing infrastructure;
1. Including fixing bugs in our tooling (which usually come up during on-call or data science project consulting)
2. Minimizing or paying back tech debt
Maintenance of tools (and occasionally pipelines);
1. Usually captured as tickets in the backlog that we need to get but maybe weren’t critical at that time
Deprecating tools & components (always fun & satisfying).

Some examples of how these activities are performed or how they come up on the platform or tool level:

➡️ Meeting with data science and data engineering to figure out if there are gaps in our MLOps stack that need to be addressed by new tooling, whether open-source or cloud vendor based. This kicks off the process of requirements gathering, writing a tech spec, building a POC, and then testing and evaluating the tool before releasing for use by the data scientists.
➡️ Prioritizing and implementing bug fixes for our current processes and tooling. Maybe we didn’t pin a specific version for a dependency and now it’s causing havoc for the data scientists. Maybe we didn’t test whether our dev environments are having issues with GPU support. Maybe an upstream dependency in a really popular data science library was changed, wasn’t communicated, and now lots of people across the internet are having issues using the latest version and we need to give the data science team guidance in the short-term.
➡️ Finally getting to those tickets we parked because they weren’t a high priority at the time, either because they were nice-to-have features or they didn’t block main development. Maybe the data scientists wanted a different documentation generation tool. Maybe there were new testing libraries that looked interesting and that we could implement in our packaging process.

Ops Bucket

For the Ops bucket, which is targeted at the workflows and processes level (i.e. the interaction of People with Technology), this includes:

Defining best practices and ensuring pipelines & models adhere to those best practices through code as well as non-technical processes;
Drive adoption of best practices & tools through workshops, office hours, documentation, & code;
Enablement through manual bridging i.e. if there are gaps or rough edges in the current toolchain or areas that are hard to automate, help push projects over the line;
Internal consulting i.e. assisting data scientists in navigating engineering decisions, code reviews, etc.

Specific examples of how an MLOps engineer can provide support at the workflow & processes level:

➡️ Embedded project consulting – This is where the MLOps team acts as an internal solutions consultant, helping to guide the data scientists to the right pattern or architecture for their model package or pipeline.
- This can also include (using GCP as an example):
  - Refactoring code;
  - Helping them get setup with GCP via Terraform;
  - Helping them develop tests;
  - Helping them navigate any difficulties with Docker, Airflow, BigQuery, and writing any shell scripts;
  - Helping them with adding the necessary credentials to their projects.
- We can also help them answer questions like
  - ”Is it a super complex project? Does it need a different tool than we support like Spark or Dataflow?“
- We can figure out additional tools we need to be looking at by understanding the unique requirements of their project, such as serving and latency requirements, and by pair coding.

This approach is sometimes called the “Embedded” or “Squad” model, because you might have specific MLOps Engineers (usually those who have prior experience working as data scientists) assigned to specific data scientists or data science teams and they share in the wins & lows of their data scientists projects.

➡️ On-call Support – On teams that tend to be more “service-based” (and are still responsible for “dev” work) getting pinged about bugs and issues can be super disruptive and results in recursive context switching, jumping from one fire to another.
- On-call support is a pattern that is mostly seen in teams or organizations that are responsible for production products and environments.
- However, on-call support can also be found in teams that deal with pre-production model development and deployment tooling and environments (as well as serving environments).

Sec 5: Distribution of Responsibilities Through Week

How do these responsibilities shake out in the day-to-day or week-to-week?

I’m going to lean very heavily into my own work experiences in this section but would like to note that an individual MLOps Engineer’s day-to-day is going to depend very heavily on the following factors:

Their total work hours (40 hrs+?)
Their seniority (i.e. are they managing other engineers, leading initiatives, mentoring, etc ?)
The team culture (Meetings galore or plenty of heads down time?)
Additional responsibilities outside of their engineer role (Sitting on hiring panels, hackathons, contributing to the tech blog, etc)
Maturity of toolchain and workflows.

All these factors will contribute to how much of the time they’re expected to focus on a single component or ticket, how much time is spent interfacing with other key stakeholders, etc.

I’m going to describe two radically different schedules for my experiences working as an MLOps Engineer at an established company and working as the “Data & ML infra gal wearing multiple hats” at a very early stage startup.

Persona 1: Sr MLOps Engineer at Established Company

Relevant Factors

Company: Email marketing, 20 yrs old
Total workweek: ~40hrs
Remote role (Team based in Atlanta, I’m based in SF)
Team sizes:
- Data Scientist: ~20+
- MLOpsy Engineers: ~15+
  - My team: ~6
- Data Engineers: ~15+
Not agile based – planning ad-hoc or weekly/bi-weekly based without real scrum or backlog grooming

Types of Meetings

Aside from the responsibilities outlined in the earlier section of this blog post:

Recurring Meetings:
- Team Meetings: Talk about problems that have come up, asks that might have an impact on our current roadmap, update on OOO time, & update the Brag Deck
- 1-1’s: With manager and teammates — These are super important with a bunch of us being remote. And when new members join the team I try to make sure first couple months are at least 1 hr in length weekly
- Org meetings: Town hall updates, important company announcements
Variable or ad-hoc:
- Project specific check-ins: When a project is being productionized we’d usually have a 30 min check-in twice a week (also includes internship project)
- Company events like hackathons, etc

Time Breakdown

Category	As In	Hrs per Week (~40hr)
Dev Bucket	See: Dev Bucket (This category has an inverse relationship with Admin & Sr Eng bucket – less meetings, more quality work time)	5-6+
Ops Bucket	See: Ops Bucket	8-9+
Sr Eng	Mentoring & managing intern	8
Admin	Team meetings (or meetings on projects & with other teams)	8

Not pictured in the schedule

Any ad-hoc meetings or follow-ups;
Self-development time;
Special planning: Quarterly planning, monthly read-outs;
Bug-fixing & patching time;
On-call rotations;
Commuting time for in-person meetings (as rare as they were).

Persona 2: Data + ML Engineer (Part-Time to Full-Time) at Startup

Relevant Factors

Company: Pre-round A Real-Estate Tech
Total workweek: ~20 ➡️~40hrs
Remote role (Team based in LA, I’m based in SF)
Team size:
- <10 people
Build the plane while flying it - Operating & accumulating strategic technical debt

Biggest Differences Between Early-Stage Startup & Established Company

The main differences between the schedule pictured above and working as the “Data & ML Person” at a startup that’s building out its ML platform while building the main product:

Swap out most meetings for head's down dev time– small team, so anything you need to say or ask for you’d do it automatically;
Less legacy code or technical debt because building from scratch – so less cross-functional time;
No org meetings or updates;
Check-ins on largely done online;
Code is shipped as soon as possible;
You spend as much time talking about the product and potential monetization streams as you do code.

Time Breakdown

Category	As In	Hrs per Week (~40hr)
Dev Bucket	See: Dev Bucket (This category has an inverse relationship with Admin & Sr Eng bucket – less meetings, more quality work time)	Most of the time
Ops Bucket	See: Ops Bucket	3-4 hrs
Admin + Strategy/Product	Strategy & product meetings	4-5 hrs

Sec 6: Closing

Although this was a long read, I’m hoping that at the end of this post, you have a really great idea as to what the week-to-week of an MLOps Engineer could look like, either at an early-stage startup or at a more established company.

Before closing, I want to emphasize a few points, however:

Titles – Titles are a finicky thing. They’re meant to serve as useful heuristics so that when someone asks what you do at a dinner party, you can shorthand the myriad of responsibilities, tasks, and ways you provide value in your role. But like all heuristics, titles are imperfect. They’re meant to be useful, rather than accurate. Keep that in mind, especially as you meet MLOps engineers with titles, and MLOps engineers without.
Implementation – The maturity and size of your org and company will determine the vastness and impact of your responsibilities. Data scientists aren’t a monolith and neither are MLOps engineers.
Evolution – Like all roles, MLOps responsibilities and roles will continue to change as the landscape changes.

And finally, I’ve added some useful links below if you’re interested in learning more about what an MLOps engineer does.

Let me know what you think at any of the following places!

🔗 LinkedIn: https://www.linkedin.com/in/mikikobazeley/
📝 Medium: https://bit.ly/3wKUwym
📬 Substack: https://mikikobazeley.substack.com/
📖 Blog: https://mikiko.hashnode.dev/
📹 Youtube: https://bit.ly/3MBR8N3
🐥 Twitter: https://twitter.com/BazeleyMikiko
🐙🐈 Github: https://github.com/MMBazel
👾 Twitch: https://bit.ly/3Akmwfe
🐘 Mastodon: https://data-folks.masto.host/@mikiko

Sec 7: Essential Links & Readings

My Prior Writings about #MLOpsCareers

Relevant Talks & Papers About #MLOpsInAction

Emily Curtin’s Talk at Data Council – Former colleague talks about some of the awesome work she’s done at Mailchimp!
Machine Learning Operations (MLOps): Overview, Definition, and Architecture – A paper that defines different components of an MLOps system and provides some perspectives on what it currently looks like in industry.
Operationalizing Machine Learning: An Interview Study – An interview study that also includes interesting quoted comments made by practitioners across a number of different companies and industries.
[Is MLOps Engineer a Thing? We Asked 6 Engineers About It](neptune.ai/blog/mlops-engineer#:~:text=MLOp.. – A blog post summarizing 6 diverse perspectives of MLOps practitioners and leaders.

Learn More About #MLOps With Me

https://www.youtube.com/watch?v=NDUwEZ0YS0I&ab_channel=Ken%27sNearestNeighborsPodcast

Learning MLOps (Gradually) for Free Through Blog Posts & Podcasts | by Mikiko Bazeley - I list some of the best podcasts, blog posts, and tweets for new and aspiring MLOps practitioners.
What is MLOps Series – I try to describe the important parts of MLOps to new and aspiring practitioners.
- Introducing (Pt 1)
- Defining MLOps as Simply as Possible (Pt 2)
- Why ML Ops Matters (Pt 3)
- Parts 4-6 to be published soon!

https://www.youtube.com/watch?v=dyGDnZr5irQ&feature=youtu.be&ab_channel=TheDataScientistShow-DalianaLiu

The Eng Side of #MLOpsCareers

An MLOps Engineer is still an engineer. I link some resources that I think are really useful for folks that are thinking about what an engineer career looks like, with or without ML.

Sec 8: Footnotes

Caveats about “What is an MLOps Engineer”

There is a ton of back-&-forth on social media platforms like LinkedIn and Youtube about whether the role of an “MLOps Engineer” is real and how useful having a specialized role in a company is, as well as whether or not DevOps engineers or data scientists should be filling the function. Most of these discussions (and flame wars) tend to neglect the nuance of size, maturity, and age of a company i.e. larger and older companies will tend to have more specialized roles, smaller companies more general roles.
There’s also an obsessive fixation on titles. Is someone an MLOps Engineer? An ML Engineer? A full-stack data scientist? My goal isn’t to wade into the alligator-infested swamp of those types of discussions.
My takes are based on my specific work experiences, discussions I’ve had with recruiters and hiring managers, job postings, and surveys that have been published.

Engineer sentiment to “sht-ops work” or “manual sht”

There are plenty of engineers I’ve talked to who believe that the embedded MLOps engineer pattern is an anti-pattern (as well as the on-call support pattern). The belief is “That which can be automated, should absolutely be automated. And if it's being done manually, it's because the automation is bad”. In theory, I agree with their position. In practice, different teams are at different levels of automation maturity. Aspire for automation but expect some level of manual support, at least until automation is built in.

Further explanation of MLOps on-call support

On-call is a practice where engineers are assigned to be available for a specific period of time to solve any blockers that come up (usually in production and on mission-critical systems).
Typically they are the first line of defense for ticket support. They also provide initial triage to understand whether an issue is due to PEBCAK (problem exists between chair & keyboard) or because of a real bug that needs to be ticketed and patched in the underlying tooling and platform.
MLOps Engineers might be assigned to on-call for a week or even a day and the schedule rotates through the team unless the team has some special shadowing going on for new hires as part of their onboarding process.
In many teams, a specific Slack channel is created where data scientists can post questions and get help from all the MLOps eyeballs watching the channel (as well as their fellow data scientists). If they really need help they can also tag the engineer that’s assigned to on-call with the expectation that an SLA will be met (i.e. they'll get a response in X time).

🤖 What An MLOps Engineer Does 💻

📆 And What The Week Can Look Like

Table of contents

Sec 1: Introduction — Who am I and why this blog post?

Sec 2: What is an MLOps Engineer?

Sec 3: What is their scope of responsibilities?

Sec 4: Detailed Breakdown of responsibilities

The Overall Buckets

Breaking Down the Dev + Ops Buckets of MLOps

Dev Bucket

Ops Bucket

Sec 5: Distribution of Responsibilities Through Week

Persona 1: Sr MLOps Engineer at Established Company

Relevant Factors

Types of Meetings

Time Breakdown

Not pictured in the schedule

Persona 2: Data + ML Engineer (Part-Time to Full-Time) at Startup

Relevant Factors

Biggest Differences Between Early-Stage Startup & Established Company

Time Breakdown

Sec 6: Closing

Sec 7: Essential Links & Readings

My Prior Writings about #MLOpsCareers

Relevant Talks & Papers About #MLOpsInAction

Learn More About #MLOps With Me

The Eng Side of #MLOpsCareers

Sec 8: Footnotes

Caveats about “What is an MLOps Engineer”

Engineer sentiment to “sht-ops work” or “manual sht”

Further explanation of MLOps on-call support

🤖 What An MLOps Engineer Does 💻

📆 And What The Week Can Look Like

Table of contents

Sec 1: Introduction — Who am I and why this blog post?

Sec 2: What is an MLOps Engineer?

Sec 3: What is their scope of responsibilities?

Sec 4: Detailed Breakdown of responsibilities

The Overall Buckets

Breaking Down the Dev + Ops Buckets of MLOps

Dev Bucket

Ops Bucket

Sec 5: Distribution of Responsibilities Through Week

Persona 1: Sr MLOps Engineer at Established Company

Relevant Factors

Types of Meetings

Time Breakdown

Not pictured in the schedule

Persona 2: Data + ML Engineer (Part-Time to Full-Time) at Startup

Relevant Factors

Biggest Differences Between Early-Stage Startup & Established Company

Time Breakdown

Sec 6: Closing

Sec 7: Essential Links & Readings

My Prior Writings about #MLOpsCareers

Relevant Talks & Papers About #MLOpsInAction

Learn More About #MLOps With Me

The Eng Side of #MLOpsCareers

Sec 8: Footnotes

Caveats about “What is an MLOps Engineer”

Engineer sentiment to “sh*t-ops work” or “manual sh*t”

Further explanation of MLOps on-call support

Engineer sentiment to “sht-ops work” or “manual sht”