Moving across the typical machine learning lifecycle can be a nightmare. From gathering and processing data to building models through experiments, deploying the best ones, and managing them at scale for continuous value in production—it’s a lot.
Machine learning platforms are increasingly looking to be the “fix” to successfully consolidate all the components of MLOps from development to production. Not only does the platform give your team the tools and infrastructure they need to build and operate models at scale, but it also applies standard engineering and MLOps principles to all use cases.
In this comprehensive guide, we’ll explore everything you need to know about machine learning platforms, including:
Components that make up an ML platform.
How to understand your users (data scientists, ML engineers, etc.).
Gathering requirements from your users.
Deciding the best approach to build or adopt ML platforms.
An ML platform standardizes the technology stack for your data team around best practices to reduce incidental complexities with machine learning and better enable teams across projects and workflows.
Why are you building an ML platform? We ask this during product demos, user and support calls, and on our ML Platform podcast. Generally, people say they do MLOps to make the development and maintenance of production machine learning seamless and efficient.
Machine learning operations (MLOps) should be easier with ML platforms at all stages of a machine learning project’s life cycle, from prototyping to production at scale, as the number of models in production grows from one or a few to tens, hundreds, or thousands that have a positive effect on the business.
The platform should be designed to orchestrate your machine learning workflow, be environment-agnostic (portable to multiple environments), and work with different libraries and frameworks.
ML platform architecture
The ML platform architecture serves as a blueprint for your machine learning system. This article defines architecture as the way the highest-level components are wired together.
The features of an ML platform and the core components that make up its architecture are:
Data stack and model development stack.
Model deployment and operationalization stack.
Workflow management component.
Administrative and security component
Core technology stack.
1. Data and model development stacks
Main components of the data and model development stacks include:
Data and feature store.
Experimentation component.
Model registry.
ML metadata and artifact repository.
Data and feature store
In a machine learning platform, feature stores (or repositories) give your data scientists a place to find and share the features they build from their datasets. It also ensures they use the same code to compute feature values for model training and inference to avoid training-serving skew.
Different teams may be involved in extracting features from different dataset sources, so a centralized storage would ensure they could all use the same set of features to train models for different use cases.
The feature stores can be offline (for finding features, training models, and batch inference services) or online (for real-time model inference with low latency).
The key benefit that a feature store brings to your platform is that it decouples feature engineering from feature usage, allowing independent development and consumption of features. Features added to a feature store become immediately available for training and serving.
How to Solve the Data Ingestion and Feature Store Component of the MLOps Stack
Read morExperimentation component
Experiment tracking can help manage how an ML model changes over time to meet your data scientists’ performance goals during training. Your data scientists develop models on this component, which stores all parameters, feature definitions, artifacts, and other experiment-related information they care about for every experiment they run.
Along with the code for training the model, this component is where they write code for data selection, exploration, and feature engineering. Based on the results of the experiments, your data scientists may decide to change the problem statement, switch the ML task, or use a different evaluation metric.
Check out the resources below to learn more about this component
The model registry component helps you put some structure into the process of productionalizing ML models for your data scientists. The model registry stores the validated training model and the metadata and artifacts that go with it.
This central repository stores and organizes models in a way that makes it more efficient to organize models across the team, making them easier to manage, deploy, and, in most cases, avoid production errors (for example, putting the wrong model into production).
Learn more
Best ML Model Registry Tools
Read more
ML metadata and artifact repository
You might need the ML metadata and artifact repository to make it easier to compare model performance and test them in the production environment. A model can be tested against the production model, drawing from the ML metadata and artifact store to make those comparisons.
Learn more about this component in this blog post about the ML metadata store, what it is, why it matters, and how to implement it.
Here’s a high-level structure of how the data stack fits into the model development environment:
2. Model deployment and operationalization stack
The main components of the model deployment and operationalization stack include the following:
Production environment.
Model serving.
Monitoring and observability.
Responsible AI and explainability.
ML metadata and artifact repository
Your data scientists can manually build and test models that you deploy to the production environment. In an ideal situation, pipelines and orchestrators take a model from the model registry, package it, test it, and then put it into production.
The production environment component lets the model be tested against the production models (if they exist) by using the ML metadata and artifact store to compare the models. You could also decide to build configurations for deployment methods like canary, shadow, and A/B deployment in the production environment.
Model serving component
When your DSs (data scientists) or MLEs (machine learning engineers) deploy the models to their target environments as services, they can serve predictions to consumers through different modalities.
The model serving component helps organize the models in production so you can have a unified view of all your models and successfully operationalize them. It integrates with the feature store for retrieving production features and the model registry for serving candidate models.
You can go through this guide to learn how to solve the model serving component of your MLOps platform.
These are the popular model serving modalities:
Online inference.
Streaming inference.
Offline batch inference.
Embedded inference.
Online inference
The ML service serves real-time predictions to clients as an API (a function call, REST API, gRPC, or similar) for every request on demand. The only concern with this service would be scalability, but that’s a typical operational challenge for software.
Streaming inference
The clients push the prediction request and input features into the feature store in real time. The service will consume the features in real time, generate predictions in near real-time, such as in an event processing pipeline, and write the outputs to a prediction queue.
The clients can read back predictions from the queue in real time and asynchronously.
Offline batch inference
The client updates features in the feature store. An ML batch job runs periodically to perform inference. The job reads features, generates predictions, and writes them to a database. The client queries and reads the predictions from the database when needed.
Embedded inference
The ML service runs an embedded function that serves models on an edge device or embedded system.
Monitoring component
Implementing effective monitoring is key to successfully operating machine learning projects. A monitoring agent regularly collects telemetry data, such as audit trails, service resource utilization, application statistics, logs, errors, etc. This makes this component of the system work. It sends the data to the model monitoring engine, which consumes and manages it.
Inside the engine is a metrics data processor that:
Reads the telemetry data,
Calculates different operational metrics at regular intervals,
And stores them in a metrics database.
The monitoring engine also has access to production data, runs an ML metrics computer, and stores the model performance metrics in the metrics database.
An analytics service provides reports and visualizations of the metrics data. When certain thresholds are passed in the computed metrics, an alerting service can send a message.
Related post
A Comprehensive Guide on How to Monitor Your Models in Production
Read more
Responsible AI and explainability component
To fully trust ML systems, it’s important to interpret these predictions. You’d need to build your platform to perform feature attribution for a given model prediction; these explanations show why the prediction was made.
You and your data scientist must implement this part together to make sure that the models and products meet the governance requirements, policies, and processes.
Since ML solutions also face threats from adversarial attacks that compromise the model and data used for training and inference, it makes sense to inculcate a culture of security for your ML assets too, and not just at the application layer (the administrative component).
Related post
Explainability and Auditability in ML: Definitions, Techniques, and Tools
Read more
3. Workflow management component
The main components here include:
Model deployment CI/CD pipeline.
Training formalization (training pipeline).
Orchestrators.
Test environment.
Model deployment CI/CD pipeline
ML models that are used in production don’t work as stand-alone software solutions. Instead, they must be built into other software components to work as a whole. This requires integration with components like APIs, edge devices, databases, microservices, etc.
The CI/CD pipeline retrieves the model from the registry, packages it as executable software, tests it for regression, and then deploys it to the production environment, which could be embedded software or ML-as-a-service.
The idea of this component is automation, and the goal is to quickly rebuild pipeline assets ready for production when you push new training code to the corresponding repository.
Bookmark for later
4 Ways Machine Learning Teams Use CI/CD in Production
Read more
Training formalization (training pipeline)
In cases where your data scientists need to retrain models, this component helps you manage repeatable ML training and testing workflows with little human intervention.
The training pipeline functions to automate those workflows. From:
Collecting data from the feature store,
To setting some hyperparameter combinations for training,
Building and evaluating the model,
Retrieving the test data from the feature store component,
Testing the model and reviewing results to validate the model’s quality,
If needed, updating the model parameters and repeating the entire process.
The pipelines primarily use schedulers and would help manage the training lifecycle through a DAG (directed acyclic graph). This makes the experimentation process traceable and reproducible, provided the other components discussed earlier have been implemented alongside it.
Related post
Building ML Pipeline: 6 Problems & Solutions
Read more
Orchestrators
The orchestrators coordinate how ML tasks run and where they get the resources to run their jobs. Orchestrators are concerned with lower-level abstractions like machines, instances, clusters, service-level grouping, replication, and so on.
Along with the schedulers, they are integral to managing the regular workflows your data scientists run and how the tasks in those workflows communicate with the ML platform.
Test environment
The test environment gives your data scientists the infrastructure and tools they need to test their models against reference or production data, usually at the sub-class level, to see how they might work in the real world before moving them to production. In this environment, you can have different test cases for your ML models and pipelines.
This article by Jeremy Jordan delves deeper into how you can effectively test your machine learning systems.
If you want to learn how others in the wild are testing their ML systems, you can check out this article focused on ML model testing I curated.
4. Administrative and security components
This component is in the application layer of the platform and handles the user workspace and interaction with the platform. Your data scientists, who are your users in most cases, barring other stakeholders, would need an interface to, for example, select compute resources, estimate costs, manage resources, and the different projects they work on.
In addition, you also need to provide some identity and access management (IAM) service so the platform only provides the necessary access level to different components and workspaces for certain users. This is a typical software design task to ensure your platform and users are secured.
5. Core technology stack
The main components of this stack include:
Programming Language.
Collaboration.
Libraries and Frameworks.
Infrastructure and Compute.
Programming language
The programming language is another crucial component of the ML platform. For one, the language would you use to develop the ML platform, and equally as important, the language your users would perform ML development with.
The most popular language with string community support that would likely ensure you are making your users’ workflow efficient would likely be Python. But then again, understand their existing stack and skillset, so you know how to complement or migrate it.
Leave a Reply