XGBClassifier on AWS Lambda

The intent of this blog post is to share how I got the XGBoost layer and our code under the AWS Lamdba limit so that others can leverage from it; also in the process I want to open it for feedback so that I can learn if things can be done differently. So please feel free to leave comments about the same.

Download the required packages

So to start, download XGBoost and scikit-learn with its dependencies. Steps are listed below. (Note: sklearn is required for XGBClassifier. I have seen some XGBoost layers available which do not have sklearn packaged with them; but XGBClassifier doesn’t work with those layers.)

  • Create a requirements.txt file with the following content :
  • Create a get_layer_packages.sh script to download the packages along with their dependencies :
#!/bin/bashexport PKG_DIR="python"rm -rf ${PKG_DIR} && mkdir -p ${PKG_DIR}docker run --rm -v $(pwd):/foo1 -w /foo1 lambci/lambda:build-python3.6 \
pip install -r requirements.txt -t ${PKG_DIR}

You can replace the content in requirements.txt with any other packages for your custom layers. Also, if you think that the dependencies are available in another layer and you don’t want to include them, you can just pass the “ — no-deps” flag in pip install. Please note that this requires docker installed on your system.

  • Total package size at this point is approximately 347 MB :
$ du -c -m -d 1 python1 python/joblib-0.14.1.dist-info
1 python/scikit_learn-0.22.1.dist-info
1 python/bin
148 python/xgboost
1 python/xgboost-0.90.dist-info
79 python/numpy
29 python/sklearn
2 python/joblib
1 python/scipy-1.4.1.dist-info
1 python/numpy-1.18.1.dist-info
90 python/scipy
347 python
347 total

Cleanup by removing unnecessary folders

  • Remove dist-info, __pycache__, test, tests folders and strip .so files :
$ rm -rf python/scipy-1.4.1.dist-info/
$ rm -rf python/xgboost-0.90.dist-info/
$ rm -rf python/joblib-0.14.1.dist-info/
$ rm -rf python/bin/
$ rm -rf python/scikit_learn-0.22.1.dist-info/
$ rm -rf python/numpy-1.18.1.dist-info/
$ find python/ -name __pycache__ | xargs rm -rf
$ find python/ -name *.so | xargs strip
$ find python/ -name test | xargs rm -rf
$ find python/ -name tests | xargs rm -rf

Size after this step is around 265 MB:

$ du -c -m -d 1 python18      python/sklearn
66 python/scipy
1 python/joblib
41 python/numpy
141 python/xgboost
265 python/
265 total

Typically this step would be enough for most of the layers. But in this case, as you can see we are still over by around 15 MB before we can deploy it.

Selective cleanup depending on the use-case

Now I was in an unknown territory and I still had to get rid of little more than 15 MB. This is where the insanity started — first by randomly deleting stuff which was not required and realizing that the modules don’t even load and then to my last chance by logically thinking about what I needed to do. I think all of us go through that phase of going crazy over something and then when we are about to lose hope, there’s is this calm mind which works better because there’s nothing to lose. I am glad that I gave it my last shot with that state of mind.

Disclaimer: This is particular to the use-case that we needed. Use your discretion while deleting or modifying stuff henceforth mentioned.

  • XGBoost cleanup
$ ls python/xgboost/build-python.sh  compat.py  dmlc-core  __init__.py  libpath.py  plotting.py  rabit.py    src          VERSIONcallback.py      core.py    include    lib          make        rabit        sklearn.py  training.py

The ones highlighted above are the directories. You’ll have to read and explore what you need and don’t need for your use-case. For running the XGBClassifier on AWS Lambda for predictions, since we don’t need the interface to allreduce and broadcast for distributed XGBoost, I decided to remove rabit and dmlc-core directories. I also deleted the src and include directories that weren’t required either.

$ rm -rf python/xgboost/rabit
$ rm -rf python/xgboost/dmlc-core/
$ rm -rf python/xgboost/src/
$ rm -rf python/xgboost/include/

This cleaned up around 3 MB.

  • sklearn cleanup

Next I repeated the process for sklearn. I chose sklearn since it was not a true dependency of xgboost, and only a part of it’s functionality was required by XGBClassifier.

$ rm -rf python/sklearn/gaussian_process/
$ rm -rf python/sklearn/model_selection/
$ rm -rf python/sklearn/tree/
$ rm -rf python/sklearn/impute/
$ rm -rf python/sklearn/feature_extraction/
$ rm -rf python/sklearn/cluster/
$ rm -rf python/sklearn/cross_decomposition/
$ rm -rf python/sklearn/neighbors/
$ rm -rf python/sklearn/experimental/
$ rm -rf python/sklearn/neural_network/
$ rm -rf python/sklearn/covariance/
$ rm -rf python/sklearn/datasets/
$ rm -rf python/sklearn/decomposition/
$ rm -rf python/sklearn/inspection/
$ rm -rf python/sklearn/metrics/
$ rm -rf python/sklearn/feature_selection/
$ rm -rf python/sklearn/svm/
$ rm -rf python/sklearn/manifold/
$ rm -rf python/sklearn/linear_model/
$ rm -rf python/sklearn/mixture/
$ rm python/sklearn/isotonic.py
$ rm python/sklearn/_isotonic.cpython-36m-x86_64-linux-gnu.so

This cleaned up around 11 MB.

Also, these are the changes I had to make to the xgboost/compat.py file –

@@ -67,10 +67,6 @@from sklearn.base import BaseEstimatorfrom sklearn.base import RegressorMixin, ClassifierMixinfrom sklearn.preprocessing import LabelEncoder-    try:-        from sklearn.model_selection import KFold, StratifiedKFold-    except ImportError:-        from sklearn.cross_validation import KFold, StratifiedKFoldSKLEARN_INSTALLED = True@@ -78,8 +74,8 @@XGBRegressorBase = RegressorMixinXGBClassifierBase = ClassifierMixin-    XGBKFold = KFold-    XGBStratifiedKFold = StratifiedKFold+    XGBKFold = None+    XGBStratifiedKFold = NoneXGBLabelEncoder = LabelEncoderexcept ImportError:SKLEARN_INSTALLED = False

At this point, the layer is under the limit. However, there’s just minimum space for code. I tried to clean up a few extra couple of MBs for code and it’s package requirements.

  • scipy cleanup
$ rm -rf python/scipy/ndimage/Change in python/scipy/setup.py :@@ -23,7 +23,6 @@config.add_subpackage('spatial')config.add_subpackage('special')config.add_subpackage('stats')-    config.add_subpackage('ndimage')config.add_subpackage('_build_utils')config.add_subpackage('_lib')config.make_config_py()Change in python/scipy/stats/stats.py :@@ -177,7 +177,7 @@from scipy._lib.six import callable, string_typesfrom scipy.spatial.distance import cdist-from scipy.ndimage import measurements+#from scipy.ndimage import measurementsfrom scipy._lib._version import NumpyVersionfrom scipy._lib._util import _lazywhere, check_random_state, MapWrapperimport scipy.special as special
  • numpy cleanup
$ rm -rf python/numpy/random/_examples/$ rm -rf python/numpy/distutils/fcompiler/$ rm -rf python/numpy/f2py/

After all these cleanups, 2274588 bytes (262144000–259869412) or 2.17 MB space is available for your code and package deployment.

$ du -m -c -d 1 python/7       python/sklearn
66 python/scipy
1 python/joblib
40 python/numpy
138 python/xgboost
249 python/
249 total$ du -b -c -d 1 python/
6164864 python/sklearn
67757024 python/scipy
580588 python/joblib
41463817 python/numpy
143899023 python/xgboost
259869412 python/
259869412 total

Though I did have to mess around with the code and delete some stuff that was not required for our use case, I still think this was worth the effort to get it working on Lambda rather than having to deploy on some external machine, which just makes it cumbersome to manage.

Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *