šŸ¤– What An MLOps Engineer Does šŸ’»

šŸ¤– What An MLOps Engineer Does šŸ’»

šŸ“† And What The Week Can Look Like

Ā·

16 min read

Featured on Hashnode

Sec 1: Introduction ā€” Who am I and why this blog post?

Hey there! My name is Mikiko and (at the time of writing) I lead MLOps at Featureform.

No alternative text description for this image

Before joining Featureform, I worked as a:

  • Sr MLOps Engineer at Mailchimp (Intuit)

  • Data Scientist at Teladoc

  • Data Analyst at Sunrun & WalkMe

  • A hybrid Data Analyst/Data Scientist at Autodesk

And a hodgepodge of roles at various early-stage startups and SMBs.

Most importantly Iā€™ve sat on the engineering hiring committee at Mailchimp, including designing and leading technical interviews for MLOps & Data Engineers, and mentored at two data science bootcamps.

As a senior member of the MLOps team, I regularly conducted interviews where half the candidates didnā€™t seem to understand the role they were being interviewed for and were unprepared in their answers as well as their pre-interview study.

And who can blame them?!

Heck, I remember when I was trying to make the transition from data scientist to MLE how varied and unclear the definition of the MLE role was, especially in planning a learning roadmap.

Hopefully, by the end of this blog post, youā€™ll have a clearer understanding of what the role entails and even how an MLOps Engineerā€™s day-to-day is spent.


Sec 2: What is an MLOps Engineer?

In this post ā€œDefining MLOps as Simply As Possibleā€ I defined MLOps (or Machine Learning Operations) as follows:

  • MLOps is the practice of productionizing machine learning artifacts in a scalable and reliable manner, where ā€œartifactsā€ can include projects, applications, services, and pipelines.

And then noted:

  • An MLOps System or Platform is a collection of tooling and processes that enables the systematic development and productionization of machine learning artifacts.

  • An MLOps Team is a collection of individuals focused on the design, development, and maintenance of the MLOps System (or Platform).

What about an MLOps Engineer then?


Sec 3: What is their scope of responsibilities?

Based on the definitions Iā€™ve put forth, it follows that I define an MLOps Engineer as a:

  • Developer & maintainer of the tooling & infrastructure that supports data science development and deployment.Ā 

In larger and more mature companies, these are NOT the actual model developers themselves or the productionizers of models (however, there are always exceptions).

In many companies, especially startups or SMBs (or even new teams within a more established company) itā€™s often expected that individuals wear multiple hats. The model developer is often also building the tools and infrastructure that facilitate their work.

In companies and industries that are incredibly specialized (like self-driving cars and robotics in the early days) this was (and still is) unavoidable because the tools didnā€™t exist and if they did exist in the open-source ecosystem they still needed to be further customized.

This is why itā€™s important to not define an MLOps engineer strictly as a specific title but as an individual or team that supports the set of practices and tools that collectively solve the unique challenges of software due to machine learning.

With that being said, in many cases, asking the people flying the airplane to build it at the same time is a quick way to burn out an individual or a team. Restructuring or adapting an existing team or org can also be tricky for various reasons (political, strategic, etc).

In this blog post weā€™re going to speak about the MLOps role as if itā€™s the more specialized persona.

Most of what I note will still apply for the hybrid roles but assume that some additional tasks or responsibilities are not being described below.


Sec 4: Detailed Breakdown of responsibilities

The Overall Buckets

Based on the definitions described above, the two most obvious categories of work an MLOps Engineer is responsible for fall into:

  • Dev - Creating, optimizing, maintaining, and deprecating the components of the infrastructure that supports model development, productionisation & deployment.

  • Ops - Ensuring adoption and enablement of the tools and processes that are meant to facilitate the data science teams' work, including unblocking projects as needed or offering support in areas where tooling and automation havenā€™t been created yet.

Depending on how senior an engineer is, they may also have additional responsibilities such as mentoring, hiring and interviewing, sitting on cross-functional initiatives, and interfacing with other engineering teams. These are responsibilities that cut across all software engineering disciplines and arenā€™t specific to MLOps.

Although the goal is to minimize operational work as much as possible through automation, in some roles the ratio of ā€œDevā€ to ā€œOpsā€ work has been roughly 30% vs 70% (especially teams that have low automation), whereas in other teams itā€™s been closer to 60% vs 40%.

Breaking Down the Dev + Ops Buckets of MLOps

Letā€™s break down the responsibilities & category of work even further.

For all the buckets, the goal is to enable new capabilities or efficiencies for data scientists & ML engineers.

How this goal is tackled can be broken out into tasks that are either focused on the tooling & infrastructure layer or people & workflow layers.

Dev Bucket

For the ā€œDevā€ bucket, the responsibilities are targeted at the platform and tool level. They include:

  • Developing infrastructure & tools ā€“ Using a combination of custom internal tools, public cloud, open-source, or external proprietary;

  • Refactoring & optimizing existing infrastructure;

    1. Including fixing bugs in our tooling (which usually come up during on-call or data science project consulting)

    2. Minimizing or paying back tech debt

  • Maintenance of tools (and occasionally pipelines);

    1. Usually captured as tickets in the backlog that we need to get but maybe werenā€™t critical at that time
  • Deprecating tools & components (always fun & satisfying).

Some examples of how these activities are performed or how they come up on the platform or tool level:

  • āž”ļø Meeting with data science and data engineering to figure out if there are gaps in our MLOps stack that need to be addressed by new tooling, whether open-source or cloud vendor based. This kicks off the process of requirements gathering, writing a tech spec, building a POC, and then testing and evaluating the tool before releasing for use by the data scientists.

  • āž”ļø Prioritizing and implementing bug fixes for our current processes and tooling. Maybe we didnā€™t pin a specific version for a dependency and now itā€™s causing havoc for the data scientists. Maybe we didnā€™t test whether our dev environments are having issues with GPU support. Maybe an upstream dependency in a really popular data science library was changed, wasnā€™t communicated, and now lots of people across the internet are having issues using the latest version and we need to give the data science team guidance in the short-term.

  • āž”ļø Finally getting to those tickets we parked because they werenā€™t a high priority at the time, either because they were nice-to-have features or they didnā€™t block main development. Maybe the data scientists wanted a different documentation generation tool. Maybe there were new testing libraries that looked interesting and that we could implement in our packaging process.

Ops Bucket

For the Ops bucket, which is targeted at the workflows and processes level (i.e. the interaction of People with Technology), this includes:

  • Defining best practices and ensuring pipelines & models adhere to those best practices through code as well as non-technical processes;

  • Drive adoption of best practices & tools through workshops, office hours, documentation, & code;

  • Enablement through manual bridging i.e. if there are gaps or rough edges in the current toolchain or areas that are hard to automate, help push projects over the line;

  • Internal consulting i.e. assisting data scientists in navigating engineering decisions, code reviews, etc.

Specific examples of how an MLOps engineer can provide support at the workflow & processes level:

  • āž”ļø Embedded project consulting ā€“ This is where the MLOps team acts as an internal solutions consultant, helping to guide the data scientists to the right pattern or architecture for their model package or pipeline.

    • This can also include (using GCP as an example):

      • Refactoring code;

      • Helping them get setup with GCP via Terraform;

      • Helping them develop tests;

      • Helping them navigate any difficulties with Docker, Airflow, BigQuery, and writing any shell scripts;

      • Helping them with adding the necessary credentials to their projects.

    • We can also help them answer questions like

      • ā€Is it a super complex project? Does it need a different tool than we support like Spark or Dataflow?ā€œ
    • We can figure out additional tools we need to be looking at by understanding the unique requirements of their project, such as serving and latency requirements, and by pair coding.

This approach is sometimes called the ā€œEmbeddedā€ or ā€œSquadā€ model, because you might have specific MLOps Engineers (usually those who have prior experience working as data scientists) assigned to specific data scientists or data science teams and they share in the wins & lows of their data scientists projects.

  • āž”ļø On-call Support ā€“ On teams that tend to be more ā€œservice-basedā€ (and are still responsible for ā€œdevā€ work) getting pinged about bugs and issues can be super disruptive and results in recursive context switching, jumping from one fire to another.

    • On-call support is a pattern that is mostly seen in teams or organizations that are responsible for production products and environments.

    • However, on-call support can also be found in teams that deal with pre-production model development and deployment tooling and environments (as well as serving environments).


Sec 5: Distribution of Responsibilities Through Week

How do these responsibilities shake out in the day-to-day or week-to-week?

Iā€™m going to lean very heavily into my own work experiences in this section but would like to note that an individual MLOps Engineerā€™s day-to-day is going to depend very heavily on the following factors:

  • Their total work hours (40 hrs+?)

  • Their seniority (i.e. are they managing other engineers, leading initiatives, mentoring, etc ?)

  • The team culture (Meetings galore or plenty of heads down time?)

  • Additional responsibilities outside of their engineer role (Sitting on hiring panels, hackathons, contributing to the tech blog, etc)

  • Maturity of toolchain and workflows.

All these factors will contribute to how much of the time theyā€™re expected to focus on a single component or ticket, how much time is spent interfacing with other key stakeholders, etc.

Iā€™m going to describe two radically different schedules for my experiences working as an MLOps Engineer at an established company and working as the ā€œData & ML infra gal wearing multiple hatsā€ at a very early stage startup.

Persona 1: Sr MLOps Engineer at Established Company

Relevant Factors

  • Company: Email marketing, 20 yrs old

  • Total workweek: ~40hrs

  • Remote role (Team based in Atlanta, Iā€™m based in SF)

  • Team sizes:

    • Data Scientist: ~20+

    • MLOpsy Engineers: ~15+

      • My team: ~6
    • Data Engineers: ~15+

  • Not agile based ā€“ planning ad-hoc or weekly/bi-weekly based without real scrum or backlog grooming

Types of Meetings

Aside from the responsibilities outlined in the earlier section of this blog post:

  • Recurring Meetings:

    • Team Meetings: Talk about problems that have come up, asks that might have an impact on our current roadmap, update on OOO time, & update the Brag Deck

    • 1-1ā€™s: With manager and teammates ā€” These are super important with a bunch of us being remote. And when new members join the team I try to make sure first couple months are at least 1 hr in length weekly

    • Org meetings: Town hall updates, important company announcements

  • Variable or ad-hoc:

    • Project specific check-ins: When a project is being productionized weā€™d usually have a 30 min check-in twice a week (also includes internship project)

    • Company events like hackathons, etc

Time Breakdown

Category

As In

Hrs per Week (~40hr)

Dev Bucket

See: Dev Bucket

(This category has an inverse relationship withĀ  Admin & Sr Eng bucket ā€“ less meetings, more quality work time)

5-6+Ā 

Ops Bucket

See: Ops Bucket

8-9+


Sr EngĀ 

Mentoring & managing intern

8

Admin

Team meetings (or meetings on projects & with other teams)

8

Not pictured in the schedule

  • Any ad-hoc meetings or follow-ups;

  • Self-development time;

  • Special planning: Quarterly planning, monthly read-outs;

  • Bug-fixing & patching time;

  • On-call rotations;

  • Commuting time for in-person meetings (as rare as they were).

Persona 2: Data + ML Engineer (Part-Time to Full-Time) at Startup

Relevant Factors

  • Company: Pre-round A Real-Estate Tech

  • Total workweek: ~20 āž”ļø~40hrs

  • Remote role (Team based in LA, Iā€™m based in SF)

  • Team size:

    • <10 people
  • Build the plane while flying it - Operating & accumulating strategic technical debt

Biggest Differences Between Early-Stage Startup & Established Company

The main differences between the schedule pictured above and working as the ā€œData & ML Personā€ at a startup thatā€™s building out its ML platform while building the main product:

  • Swap out most meetings for head's down dev timeā€“ small team, so anything you need to say or ask for youā€™d do it automatically;

  • Less legacy code or technical debt because building from scratch ā€“ so less cross-functional time;

  • No org meetings or updates;

  • Check-ins on largely done online;

  • Code is shipped as soon as possible;

  • You spend as much time talking about the product and potential monetization streams as you do code.

Time Breakdown

Category

As In

Hrs per Week (~40hr)

Dev Bucket

See: Dev Bucket

(This category has an inverse relationship withĀ  Admin & Sr Eng bucket ā€“ less meetings, more quality work time)

Most of the time

Ops Bucket

See: Ops Bucket

3-4 hrs


Admin + Strategy/Product

Strategy & product meetings

4-5 hrs


Sec 6: Closing

Although this was a long read, Iā€™m hoping that at the end of this post, you have a really great idea as to what the week-to-week of an MLOps Engineer could look like, either at an early-stage startup or at a more established company.

Before closing, I want to emphasize a few points, however:

  • Titles ā€“ Titles are a finicky thing. Theyā€™re meant to serve as useful heuristics so that when someone asks what you do at a dinner party, you can shorthand the myriad of responsibilities, tasks, and ways you provide value in your role. But like all heuristics, titles are imperfect. Theyā€™re meant to be useful, rather than accurate. Keep that in mind, especially as you meet MLOps engineers with titles, and MLOps engineers without.

  • Implementation ā€“ The maturity and size of your org and company will determine the vastness and impact of your responsibilities. Data scientists arenā€™t a monolith and neither are MLOps engineers.

  • Evolution ā€“ Like all roles, MLOps responsibilities and roles will continue to change as the landscape changes.

And finally, Iā€™ve added some useful links below if youā€™re interested in learning more about what an MLOps engineer does.

Let me know what you think at any of the following places!


Sec 7: Essential Links & Readings

My Prior Writings about #MLOpsCareers

Relevant Talks & Papers About #MLOpsInAction

Learn More About #MLOps With Me

The Eng Side of #MLOpsCareers

An MLOps Engineer is still an engineer. I link some resources that I think are really useful for folks that are thinking about what an engineer career looks like, with or without ML.


Sec 8: Footnotes

Caveats about ā€œWhat is an MLOps Engineerā€

  • There is a ton of back-&-forth on social media platforms like LinkedIn and Youtube about whether the role of an ā€œMLOps Engineerā€ is real and how useful having a specialized role in a company is, as well as whether or not DevOps engineers or data scientists should be filling the function. Most of these discussions (and flame wars) tend to neglect the nuance of size, maturity, and age of a company i.e. larger and older companies will tend to have more specialized roles, smaller companies more general roles.

  • Thereā€™s also an obsessive fixation on titles. Is someone an MLOps Engineer? An ML Engineer? A full-stack data scientist? My goal isnā€™t to wade into the alligator-infested swamp of those types of discussions.

  • My takes are based on my specific work experiences, discussions Iā€™ve had with recruiters and hiring managers, job postings, and surveys that have been published.

Engineer sentiment to ā€œsh*t-ops workā€ or ā€œmanual sh*tā€

There are plenty of engineers Iā€™ve talked to who believe that the embedded MLOps engineer pattern is an anti-pattern (as well as the on-call support pattern). The belief is ā€œThat which can be automated, should absolutely be automated. And if it's being done manually, it's because the automation is badā€. In theory, I agree with their position. In practice, different teams are at different levels of automation maturity. Aspire for automation but expect some level of manual support, at least until automation is built in.

Further explanation of MLOps on-call support

  • On-call is a practice where engineers are assigned to be available for a specific period of time to solve any blockers that come up (usually in production and on mission-critical systems).

  • Typically they are the first line of defense for ticket support. They also provide initial triage to understand whether an issue is due to PEBCAK (problem exists between chair & keyboard) or because of a real bug that needs to be ticketed and patched in the underlying tooling and platform.

  • MLOps Engineers might be assigned to on-call for a week or even a day and the schedule rotates through the team unless the team has some special shadowing going on for new hires as part of their onboarding process.

  • In many teams, a specific Slack channel is created where data scientists can post questions and get help from all the MLOps eyeballs watching the channel (as well as their fellow data scientists). If they really need help they can also tag the engineer thatā€™s assigned to on-call with the expectation that an SLA will be met (i.e. they'll get a response in X time).

Ā