XGBClassifier on AWS Lambda

The intent of this blog post is to share how I got the XGBoost layer and our code under the AWS Lamdba limit so that others can leverage from it; also in the process I want to open it for feedback so that I can learn if things can be done differently. So please feel free to leave comments about the same.

Download the required packages

So to start, download XGBoost and scikit-learn with its dependencies. Steps are listed below. (Note: sklearn is required for XGBClassifier. I have seen some XGBoost layers available which do not have sklearn packaged with them; but XGBClassifier doesn’t work with those layers.)

  • Create a requirements.txt file with the following content :
xgboost==0.90
scikit-learn==0.22.1
  • Create a get_layer_packages.sh script to download the packages along with their dependencies :
#!/bin/bash

export PKG_DIR="python"

rm -rf ${PKG_DIR} && mkdir -p ${PKG_DIR}

docker run --rm -v $(pwd):/foo1 -w /foo1 lambci/lambda:build-python3.6 \
    pip install -r requirements.txt -t ${PKG_DIR}

You can replace the content in requirements.txt with any other packages for your custom layers. Also, if you think that the dependencies are available in another layer and you don’t want to include them, you can just pass the “ — no-deps” flag in pip install. Please note that this requires docker installed on your system.

  • Total package size at this point is approximately 347 MB :
$ du -c -m -d 1 python

1 python/joblib-0.14.1.dist-info
1 python/scikit_learn-0.22.1.dist-info
1 python/bin
148 python/xgboost
1 python/xgboost-0.90.dist-info
79 python/numpy
29 python/sklearn
2 python/joblib
1 python/scipy-1.4.1.dist-info
1 python/numpy-1.18.1.dist-info
90 python/scipy
347 python
347 total

Cleanup by removing unnecessary folders

  • Remove dist-info, __pycache__, test, tests folders and strip .so files :
$ rm -rf python/scipy-1.4.1.dist-info/
$ rm -rf python/xgboost-0.90.dist-info/
$ rm -rf python/joblib-0.14.1.dist-info/
$ rm -rf python/bin/
$ rm -rf python/scikit_learn-0.22.1.dist-info/
$ rm -rf python/numpy-1.18.1.dist-info/
$ find python/ -name __pycache__ | xargs rm -rf
$ find python/ -name *.so | xargs strip
$ find python/ -name test | xargs rm -rf
$ find python/ -name tests | xargs rm -rf

Size after this step is around 265 MB:

$ du -c -m -d 1 python

18      python/sklearn
66      python/scipy
1       python/joblib
41      python/numpy
141     python/xgboost
265     python/
265     total

Typically this step would be enough for most of the layers. But in this case, as you can see we are still over by around 15 MB before we can deploy it.

Selective cleanup depending on the use-case

Now I was in an unknown territory and I still had to get rid of little more than 15 MB. This is where the insanity started — first by randomly deleting stuff which was not required and realizing that the modules don’t even load and then to my last chance by logically thinking about what I needed to do. I think all of us go through that phase of going crazy over something and then when we are about to lose hope, there’s is this calm mind which works better because there’s nothing to lose. I am glad that I gave it my last shot with that state of mind.

Disclaimer: This is particular to the use-case that we needed. Use your discretion while deleting or modifying stuff henceforth mentioned.

  • XGBoost cleanup
$ ls python/xgboost/

build-python.sh  compat.py  dmlc-core  __init__.py  libpath.py  plotting.py  rabit.py    src          VERSIONcallback.py      core.py    include    lib          make        rabit        sklearn.py  training.py

The ones highlighted above are the directories. You’ll have to read and explore what you need and don’t need for your use-case. For running the XGBClassifier on AWS Lambda for predictions, since we don’t need the interface to allreduce and broadcast for distributed XGBoost, I decided to remove rabit and dmlc-core directories. I also deleted the src and include directories that weren’t required either.

$ rm -rf python/xgboost/rabit
$ rm -rf python/xgboost/dmlc-core/
$ rm -rf python/xgboost/src/
$ rm -rf python/xgboost/include/

This cleaned up around 3 MB.

  • sklearn cleanup

Next I repeated the process for sklearn. I chose sklearn since it was not a true dependency of xgboost, and only a part of it’s functionality was required by XGBClassifier.

$ rm -rf python/sklearn/gaussian_process/
$ rm -rf python/sklearn/model_selection/
$ rm -rf python/sklearn/tree/
$ rm -rf python/sklearn/impute/
$ rm -rf python/sklearn/feature_extraction/
$ rm -rf python/sklearn/cluster/
$ rm -rf python/sklearn/cross_decomposition/
$ rm -rf python/sklearn/neighbors/
$ rm -rf python/sklearn/experimental/
$ rm -rf python/sklearn/neural_network/
$ rm -rf python/sklearn/covariance/
$ rm -rf python/sklearn/datasets/
$ rm -rf python/sklearn/decomposition/
$ rm -rf python/sklearn/inspection/
$ rm -rf python/sklearn/metrics/
$ rm -rf python/sklearn/feature_selection/
$ rm -rf python/sklearn/svm/
$ rm -rf python/sklearn/manifold/
$ rm -rf python/sklearn/linear_model/
$ rm -rf python/sklearn/mixture/
$ rm python/sklearn/isotonic.py
$ rm python/sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so

This cleaned up around 11 MB.

Also, these are the changes I had to make to the xgboost/compat.py file –

@@ -67,10 +67,6 @@
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin, ClassifierMixin
from sklearn.preprocessing import LabelEncoder
-    try:
-        from sklearn.model_selection import KFold, StratifiedKFold
-    except ImportError:
-        from sklearn.cross_validation import KFold, StratifiedKFold
SKLEARN_INSTALLED = True
@@ -78,8 +74,8 @@
XGBRegressorBase = RegressorMixin
XGBClassifierBase = ClassifierMixin
-    XGBKFold = KFold
-    XGBStratifiedKFold = StratifiedKFold
+    XGBKFold = None
+    XGBStratifiedKFold = None
XGBLabelEncoder = LabelEncoder
except ImportError:
SKLEARN_INSTALLED = False

At this point, the layer is under the limit. However, there’s just minimum space for code. I tried to clean up a few extra couple of MBs for code and it’s package requirements.

  • scipy cleanup
$ rm -rf python/scipy/ndimage/

Change in python/scipy/setup.py :
@@ -23,7 +23,6 @@
config.add_subpackage('spatial')
config.add_subpackage('special')
config.add_subpackage('stats')
-    config.add_subpackage('ndimage')
config.add_subpackage('_build_utils')
config.add_subpackage('_lib')
config.make_config_py()

Change in python/scipy/stats/stats.py :
@@ -177,7 +177,7 @@
from scipy._lib.six import callable, string_types
from scipy.spatial.distance import cdist
-from scipy.ndimage import measurements
+#from scipy.ndimage import measurements
from scipy._lib._version import NumpyVersion
from scipy._lib._util import _lazywhere, check_random_state, MapWrapper
import scipy.special as special
  • numpy cleanup
$ rm -rf python/numpy/random/_examples/
$ rm -rf python/numpy/distutils/fcompiler/
$ rm -rf python/numpy/f2py/

After all these cleanups, 2274588 bytes (262144000–259869412) or 2.17 MB space is available for your code and package deployment.

$ du -m -c -d 1 python/

7       python/sklearn
66      python/scipy
1       python/joblib
40      python/numpy
138     python/xgboost
249     python/
249     total

$ du -b -c -d 1 python/

6164864         python/sklearn
67757024        python/scipy
580588          python/joblib
41463817        python/numpy
143899023       python/xgboost
259869412       python/
259869412       total

Though I did have to mess around with the code and delete some stuff that was not required for our use case, I still think this was worth the effort to get it working on Lambda rather than having to deploy on some external machine. With this, I can avail the many benefits of serverless computing without having to worry about scalability, server manageability and code deployments to name a few.

Please follow and like us:

5 thoughts on “XGBClassifier on AWS Lambda

  1. Hello Swati,

    Great post, buuut … I still have some difficulties getting just XGBoost under 250MB. Is it possible for you to explain the diff you did on xgboost/compat.py? Having diffculties there.

    Even better if you could share the complete layer 🙂

    Regards, Hans

    1. Hello Hans,

      I have updated the format due to which the diff wasn’t clear earlier. Also, please note that some of the things that I removed / cleaned up were not required for how I was using XGBClassifier. So, in case you are using some of the removed components, you might have to update the layer accordingly. I have uploaded the xgboost layer that works for us on github – https://github.com/swatiagarwal-s/xgboost-lambda-layer/releases/tag/v1.

      See if that works for you.

      Swati

Leave a Reply

Your email address will not be published. Required fields are marked *