2

I am using svm light files as a storage for sparse matrix.

A test shows that for a 31700108x54070 matrix with 570601944 entries

import xgboost as xgb
dtrain = xgb.DMatrix(train_file)

used 21seconds, way faster than

from sklearn.datasets import load_svmlight_file
x_train, y_train = load_svmlight_file(train_file)

used 7minutes.

Before hacking the code, anybody can help me answer this?

Do you have any suggestions to boost the load_svmlight_file function?

Thank you very much!

Vimos
  • 691
  • 1
  • 9
  • 23

1 Answers1

3

Xgboost is written in c++ and uses ctypes to wrap that in a python package. The implementation of load_svmlight_file is written in cython, which takes python code and translates it to c. Ideally, cython would produce perfect c code, however sometimes it will produce code worse than what a c programmer would do.

The scikit people themselves acknowledge that load_svmlight_file is not as efficient as it could be and point to another library written in c++.

This implementation is written in Cython and is reasonably fast. However, a faster API-compatible loader is also available at: https://github.com/mblondel/svmlight-loader

Giannis Spiliopoulos
  • 2,628
  • 18
  • 27