2

I am using python 2.7 with svmlight to store and read a very big svmlight format file.

I am reading the file using

import sklearn
rows, labels = sklearn.datasets.load_svmlight_file(matrixPath, zero_based=True)

The file is too big to be stored in memory. I am looking for a way to iterate over the file in batches without the need to split the file in advance.

For now the best way i found is to split the svmlight file using terminal command split. and then reading the partial files i created.

I found that a good way to read big files is reading in batches of line by line in order not to overflow the memory.

How can i do this with svmlight formated files?

Thanks!

thebeancounter
  • 4,261
  • 8
  • 61
  • 109
  • In order to do any meaningful operations on the data in batch, atleast one pass will be made from the whole file to load the total number of features in it. – Vivek Kumar Jul 17 '17 at 11:07
  • @VivekKumar no problem, i just can't keep all the matrix in the memory at one time, iterating over it is not a problem in any way. – thebeancounter Jul 17 '17 at 11:31

1 Answers1

4

I came across the same problem, here is my solution:

Using the load_svmlight_file function from scikitlearn, you can specify the offset and length parameters. From the documentation:

offset : integer, optional, default 0

  • Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character.

length : integer, optional, default -1

  • If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold.

Here is an example for how to iterate over your svmlight file in batches:

from sklearn.datasets import load_svmlight_file

def load_svmlight_batched(filepath, n_features, batch_size):
    offset = 0
    with open(filepath, 'rb') as f:
        X, y = load_svmlight_file(f, n_features=n_features,
                                  offset=offset, length=batch_size)
        while X.shape[0]:
            yield X, y
            offset += batch_size
            X, y = load_svmlight_file(f, n_features=n_features,
                                      offset=offset, length=batch_size)

def main(filepath):
    iterator = load_svmlight_batched(filepath, 
                                     n_features=2**14, 
                                     batch_size=10000)
    for X_batch, y_batch in iterator:
        # Do something