Description: The team constructed a Machine Learning Operations (MLOps) Pipeline.

 

Overview: MLOps serves as the intersection of machine learning, data engineering, and DevOps and is process of deploying a machine learning model into production.

 

Goals:

Develop MLOps Pipeline

Deploy tools in a multi-cloud environment

Align to current delivery efforts

 

Problem Statement: Given aggregated information through year N-1 of an officer’s Naval Medical Corps Career, predict the probability they will leave in year N.

 

The team performed the following activities:

·       Explored BUMIS II data set.

·       Removed PII for cloud use through data anonymization.

·       Conducted extensive data cleaning to remove variables that were not populated adequately.

·       Combined single rows of data containing an individual’s year in their career into an individual row summarizing career(~10K records).

·       Parsed the data using 90% for the training setand 10% as a test set.

·       Performed model sensitivity analysis throughinvestigating model output impacts based on individual parameter changes.

·       Incorporated Key Performance Indicators (KPIs)in user feedback loops to support approval/disapproval of model deployment in a production environment.

Used JDSAT’s AWS Dev Sandbox Account to train and test the ML Model

·       AWS Service: Sagemaker used to train the model

·       Model: Classic logistic regression

·       Algorithm: XGBoost

 

Pros of Sagemaker:

·       Multiple options available for modelling a variety of problems

·       Model versioning

·       Data/Model Drift Scenarios

·       Hyper Parameter Tuning

 

In addition to AWS Sagemaker, the team explored similar technologies inside of the Google Cloud Platform service BigQuery.

 

Pros of BigQuery ML:

·       Serverless data warehouse with SQL-like querying and built in ML capabilities

·       Great for ad-hoc requests

Cons of BigQuery ML:

·       Less robust

·       Lacks portability

 

Results: The team successfully created a functional and interactive MLOps pipeline showcasing ML capabilities across multiple cloud platforms (AWS and GoogleCloud Platform). In addition, the team evaluated costs for ML Services and considered data security when deploying confidential data to the cloud.