Data Engineering Pipelines / Data pipelines
The machine learning model is as good as the data. Hence, it is important to set up the data piplelines for data extraction, exploration, validation, wrangling/cleaning and splitting the raw data for making it machine learning ready. The objective of data pipelines are to perform operation to create training and testing datasets.
- Data Extraction: Extract data from datbases, API, message queuing systems, web or network locations.
- Data Validation: Validate data range, data schemas
- Data Preprocessing:Clean, transform data,
- Feature Engineering: Extract the feature from the datasets.
It is best practice to version control the outupt of this pipeline for reproducability and governence(auditing).To version control Features, feature Registery is needed.
Feature Registry:
Features registry/store are backend where features are stored along with their timestap tracking every entries creation and uptades. Onces the feature registy backend is setup, the python scricpr can connect to the feature store to read and write features to it.
Level 1: A simple feature registry could be maintained using a relational database design by adding the feature to track the time.
Level 2: As the data moves from stuctured to unstructured, it is good idea to save the data as objecs. Save the snapshot of feautures in backend(AWS S3, shared network locations)
Level 3: Use data lake data versioning tool like Data Version Control(DVC). Is there existing DVC used in the datalake?
Like Datapipeline, other processes and piplelines could write to feature store as well.
Machine Learning (ML) pipelines
- Feature Extraction: Extract the features from the feature stores.
- Feature Preprocess: feature transformation, splitting data to training, testing and validation sets.
- Model Training
- Model Testing
- Model Evaluation
Set up for the machine learnig piple line would be similar to the setup of data piplelines.
Model Registry
We can searilize the model files and use the same concept as feature registry to store and version control the model images and artifacts. The Machine learning pipeline will evaluate the model and write it to model registery. The model serving code will checkout the latest or tagged model from the model registery and make prediction using it.
Level 1 implementation
model table
id | model name | serving version |
---|---|---|
1 | nlp | 1.0 |
2 | randomforest | 1.2 |
3 | RNN | 1.0 |
model artifact table
id | model_id | version | image | metrics |
---|---|---|---|---|
a1 | 1 | 1.0 | 01010 | ——- |
a2 | 2 | 1.0 | 10111 | ——- |
a3 | 2 | 1.1 | 00001 | ——- |
a4 | 2 | 1.2 | 00010 | ——- |
a5 | 3 | 1.0 | 00100 | ——- |
a6 | 3 | 1.1 | 10011 | ——- |
Level 2:
Open source tools like mlflow, comes with built in model tracking sercive. One can host the model registry service by setting up custom tracking server for registering the model. Moreover, mlflow also has built in UI for model tracking.
AWS Sagemaker also provides model registry service, which is easy to setup and use.
Model Serving
Model serving could be the multiple combination based on the nature of model serving.
Level1: Batch processig jobs
Batch processing python scripts can be scheduled to checkout the model from model registry and make predictions by feeding the data to the models. For, Tabaleu reporting , the python scripts can be scheduled to connect to Tabaleu server and write predictions. The structure of the batch scripts looks like below:
- Extract the data to make predictions
- Transform the data using feature engineering pipeline script
- Checkout the serving model from model registery
- Make predictions
- Transform the prediction to make it ready for datbase.
- Update the database with pediction value
- Provide the feedback and store the new features to feature registry
Level 2: API
The model can be packaged and served as API(Model as a Service). User can call API for and post their data for prediction. The API end point will trigger the pipeline make predictions. The batch processing job could still call the API to make predictions. API could be integrated to databases, API, microservices, tabaleu, excel. APIs can be served as server side templating(JinJa, HTML, markdowns) or whole fornt end(React, Anguler, Vue)
Options:
FAST API: Rapid Development, Asyncronys design, auto documentation and data validation through pydantic data models, server side templating compatible.
FLASK: Has more support community and is in market for longer time, server side templating compatible.
Level 3: Adding message queueing system to the API backed or batch processing workflow.
As vloume and velocity of the data grows, the API could not handle the incoming request. This problem could be solved by implementing Message Broker system, when the incoming requests api are published to message queues and the model will make prediction by consuming the data from message queues and publish the data to the Message queue. This way the large datasets can be consumed and published/stream.
Options:
- Kakfa
- Amazon SQS
Model monitoring
Resources:
- https://ml-ops.org/
- https://github.com/visenger/awesome-mlops?tab=readme-ov-file
- https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning