python sklearn read very big svmlight file

Question

I am using python 2.7 with svmlight to store and read a very big svmlight format file.

I am reading the file using

import sklearn
rows, labels = sklearn.datasets.load_svmlight_file(matrixPath, zero_based=True)

The file is too big to be stored in memory. I am looking for a way to iterate over the file in batches without the need to split the file in advance.

For now the best way i found is to split the svmlight file using terminal command split. and then reading the partial files i created.

I found that a good way to read big files is reading in batches of line by line in order not to overflow the memory.

How can i do this with svmlight formated files?

Thanks!

In order to do any meaningful operations on the data in batch, atleast one pass will be made from the whole file to load the total number of features in it. — Vivek Kumar, Jul 17 '17 at 11:07
@VivekKumar no problem, i just can't keep all the matrix in the memory at one time, iterating over it is not a problem in any way. — thebeancounter, Jul 17 '17 at 11:31

Sven van der Burg · Answer 1 · 2018-07-17T09:27:35.360

I came across the same problem, here is my solution:

Using the load_svmlight_file function from scikitlearn, you can specify the offset and length parameters. From the documentation:

offset : integer, optional, default 0

Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character.

length : integer, optional, default -1

If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold.

Here is an example for how to iterate over your svmlight file in batches:

from sklearn.datasets import load_svmlight_file

def load_svmlight_batched(filepath, n_features, batch_size):
    offset = 0
    with open(filepath, 'rb') as f:
        X, y = load_svmlight_file(f, n_features=n_features,
                                  offset=offset, length=batch_size)
        while X.shape[0]:
            yield X, y
            offset += batch_size
            X, y = load_svmlight_file(f, n_features=n_features,
                                      offset=offset, length=batch_size)

def main(filepath):
    iterator = load_svmlight_batched(filepath, 
                                     n_features=2**14, 
                                     batch_size=10000)
    for X_batch, y_batch in iterator:
        # Do something

python sklearn read very big svmlight file

1 Answers1