Converting Machine Learning Models to SAS using m2cgen (Python)

January 03, 2021

Converting Machine Learning Models to SAS using m2cgen (Python)

A hack to deploy your trained ML models such as XGBoost and LightGBM in SAS

m2cgen is a very friendly package that can convert a lot of different trained models to supported languages like R and VBA. However, SAS is not yet supported by m2cgen. This article is for those who needs to deploy the trained models in SAS environment. The track introduced in this article is to convert the model to VBA codes first and change the VBA codes to SAS scripts.

The scripts used in this tutorial are uploaded to my Github repo, feel free to clone the files.

Package

m2cgen

Functionality

m2cgen(Model 2 Code Generator) — is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#).

Demonstration

Convert XGBoost model to VBA, then to SAS scripts
Convert XGBoost model to VBA, then to SAS scripts (with missing values)

Data

The Iris Dataset loaded from sklearn

Task 1: Convert XGBoost model to VBA

# import packages
import pandas as pd
import numpy as np
import os 
import re

from sklearn import datasets
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import m2cgen as m2c

# import data
iris = datasets.load_iris()
X = iris.data
Y = iris.target

First of all, we import the packages and data needed for this task.

# split data into train and test sets
seed = 2020
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)

Then, let’s train a simple XGBoost model.

code = m2c.export_to_visual_basic(model, function_name = 'pred')

Next, convert XGBoost model to VBA. Using the function, export_to_visual_basic of m2cgen can get your trained XGBoost model in VBA language. The scripts to convert to other languages are also as simple as the one to VBA.

Here comes the core of this tutorial, after converting the model to VBA, there are some steps needed to convert the VBA codes to SAS scripts such as removing many unnecessary lines that are not used in SAS environment such as “Module xxx”, “Function yyy” and “Dim var Z As Double”, and inserting “;” to the end of statements to follow the syntax rules in SAS.

# remove unnecessary things
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)

# change the script to sas scripts
# change the beginning
code = re.sub('Module Model\nFunction pred\(ByRef inputVector\(\) As Double\) As Double\(\)\n', 
                'DATA pred_result;\nSET dataset_name;', code)

# change the ending
code = re.sub('End Function\nEnd Module\n', 'RUN;', code)

# insert ';'
all_match_list = re.findall('[0-9]+\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
all_match_list = re.findall('\)\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)

# replace the 'inputVector' with var name
dictionary = {'inputVector(0)':'sepal_length',
              'inputVector(1)':'sepal_width',
              'inputVector(2)':'petal_length',
              'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
    code = code.replace(key, dictionary[key])

# change the prediction labels
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*\n', '', code)
temp_var_list = re.findall(r"var[0-9]+\(\d\)", code)
for var_idx in range(len(temp_var_list)):
    code = re.sub(re.sub('\\(', '\\(', re.sub('\\)', '\\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

Step-by-step Explanation:

# remove unnecessary things
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)

# change the beginning
code = re.sub('Module Model\nFunction pred\(ByRef inputVector\(\) As Double\) As Double\(\)\n', 
                'DATA pred_result;\nSET dataset_name;', code)

# change the ending
code = re.sub('End Function\nEnd Module\n', 'RUN;', code)

The first three parts are quite straight-forward. We simply take away the unwanted lines with the use of regex, then change the beginning of the scripts to “DATA pred_result;\nSET dataset_name;” where pred_result refers to the output table name after running the SAS scripts and dataset_name refers to the input table name that we need to predict. The last part is to change the ending of the script to “RUN;”.

# insert ';'
all_match_list = re.findall('[0-9]+\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
all_match_list = re.findall('\)\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)

To follow the syntax rules in SAS, “;” is needed to indicate the end of each statement.

# replace the 'inputVector' with var name
dictionary = {'inputVector(0)':'sepal_length',
              'inputVector(1)':'sepal_width',
              'inputVector(2)':'petal_length',
              'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
    code = code.replace(key, dictionary[key])

Making use of dictonary, we can map the “InputVector” with the variable names in the input dataset and change all the “InputVector” in one go.

# change the prediction labels
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*\n', '', code)
temp_var_list = re.findall(r"var[0-9]+\(\d\)", code)
for var_idx in range(len(temp_var_list)):
    code = re.sub(re.sub('\\(', '\\(', re.sub('\\)', '\\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

The last part of the conversion steps is to change the prediction labels.

# save output
vb = open('vb1.sas', 'w')
vb.write(code)
vb.close()

Lastly, we can save the output with suffix, “.sas”

That’s the end of the first task, and now, you should be able to convert your trained models to SAS scripts. To double check if there are any issues with the SAS scripts created, you can use the below scripts for checking the difference of python prediction and SAS prediction. Please note that the predicted probabilities (python vs SAS) show a little difference, but the difference should not be very significant.

# python pred
python_pred = pd.DataFrame(model.predict_proba(X_test))
python_pred.columns = ['setosa_prob','versicolor_prob','virginica_prob']
python_pred

# sas pred
sas_pred = pd.read_csv('pred_result.csv')
sas_pred = sas_pred.iloc[:,-3:]
sas_pred

(abs(python_pred - sas_pred) > 0.00001).sum()

Task 2: Convert XGBoost model to VBA, then to SAS scripts (with missing values)

If your data does not have missing values in training data, XGBoost by default puts the “missing values” on the left node when generating the tree (as illustrated in the tree diagram below). From the scripts generated from the m2cgen, you can find the conditions tested are always be if a variable is greater than or equal to a given number. Thus if there is missing value in the testing or prediction dataset, the script will leave the “missing value” to the else part. For example, in our SAS script generated from task 1, the first test condition is “ If (petal_length) >= (2.45) Then var0 = -0.21827246; Else var0 = 0.42043796;”, so if petal_length is missing, it is not greater than or equal to 2.45, the var0 will be assigned as 0.42043796. Another example is shown below.

What if your training data contains missing values? XGBoost will put the “missing values” to the left or right node based on the training results. (Thus, you can see the conditions shown in the SAS scripts are sometimes “<” and sometimes “>=”)

You can create the dataset with missing values using the below scripts and repeat the steps in task 1 to see and compare the prediction output of SAS and python.

from random import sample
from sklearn import datasets

# import data
iris = datasets.load_iris()
X = iris.data
Y = iris.target

# assume that there are missing values in the first 2 columns of the data
sequence = [i for i in range(len(X))]
subset0 = sample(sequence, 30)
subset1 = sample(sequence, 50)
subset2 = sample(sequence, 40)
subset3 = sample(sequence, 60)
X[[(subset0)],0] = np.nan
X[[(subset1)],1] = np.nan
X[[(subset0)],2] = np.nan
X[[(subset1)],3] = np.nan

I did the testing and found that some rows have the same prediction output while some rows show big differences (see the below picture).

I compare the var generated in the intermediate steps in the SAS scripts. Let’s take the second row shown below as an example. If the condition tested is “If (petal_length) >= (2.45) Then var0 = -0.217358515; Else …” and the petal_length is missing, so it does not fulfill the condition and go to the else statement, then the second condition tested is “If (petal_width) >= (0.84999996) Then var0 = -0.155172437; Else …” and the petal_width is 0.2, again, it does not fulfill the condition and go to the else statement. Next, we go to the third condition, “If (sepal_length) < (11.600001) Then var0 = 0.411428601; Else …” and we see that sepal_length is missing, it should not fulfill the condition but SAS somehow accept it as True and the var0 is 0.411428601.

Therefore, to cater for this scenario, I added some scripts to force the script to check if the value is missing or not first.

# handle missing values
all_match_list = re.findall('If.*Then', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = ' '.join(original_str.split()[:-1] + ['and not missing', original_str.split()[1], ' Then'])
    code = code.replace(original_str, new_str)

Therefore, the manual scripts to convert VBA to SAS will change to the below scripts. You can find the full version in my GitHub repo.

# remove unnecessary things
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)

# change the beginning
code = re.sub('Module Model\nFunction pred\(ByRef inputVector\(\) As Double\) As Double\(\)\n', 
                'DATA pred_result;\nSET dataset_name;', code)

# change the ending
code = re.sub('End Function\nEnd Module\n', 'RUN;', code)

# insert ';'
all_match_list = re.findall('[0-9]+\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)
all_match_list = re.findall('\)\n', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = all_match_list[idx][:-1]+';\n'
    code = code.replace(original_str, new_str)

# handle missing values
all_match_list = re.findall('If.*Then', code)
for idx in range(len(all_match_list)):
    original_str = all_match_list[idx]
    new_str = ' '.join(original_str.split()[:-1] + ['and not missing', original_str.split()[1], ' Then'])
    code = code.replace(original_str, new_str)

# replace the 'inputVector' with var name
dictionary = {'inputVector(0)':'sepal_length',
              'inputVector(1)':'sepal_width',
              'inputVector(2)':'petal_length',
              'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
    code = code.replace(key, dictionary[key])

# change the prediction labels
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*\n', '', code)
temp_var_list = re.findall(r"var[0-9]+\(\d\)", code)
for var_idx in range(len(temp_var_list)):
    code = re.sub(re.sub('\\(', '\\(', re.sub('\\)', '\\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

In this tutorial, I made use of the m2cgen package to convert a XGBoost model to VBA codes and then SAS scripts, but it is not necessary to convert to VBA codes first, you can pick other languages like C or JAVA if you prefer. This tutorial just demonstrated a hack to convert the scripts from one language to another language.

To know more about using m2cgen, please go to the official Github repository.

cyda