0

So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.

It seems that x_train.shape = (number_of_samples, number_of_features)

For x_train I should use the extracted xvector.scp file, which I am reading like so:

b = kaldiio.load_scp('xvector.scp')

And I can print the content like so:

for file_id in b:
  xvector = b[file_id]
  print(xvector)

Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.

My question is how can I make an array that contains only the xvector variables?

PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array

Julian
  • 4,176
  • 19
  • 40
Petar Yakov
  • 169
  • 2
  • 14
  • 1
    Does this answer your question? [Is it possible to train a sklearn model (eg SVM) incrementally?](https://stackoverflow.com/questions/54722861/is-it-possible-to-train-a-sklearn-model-eg-svm-incrementally) – desertnaut Feb 11 '21 at 11:29
  • It could be useful, but I am still not sure how to convert the b into the format needed to be passed to the .fit() method. Right now it is like a file reader and not an array that can be passed as the 'x' argument to the .fit() method. – Petar Yakov Feb 11 '21 at 13:40
  • 1
    Please do not coflate questions; as is, you do not seem to ask anything specific to the file format used. If this is your actual question, please edit & update your post to clarify explicitly. – desertnaut Feb 11 '21 at 13:42
  • 1
    In any case, this would be a question of converting between file formats, and not anything to do with LR itself. – desertnaut Feb 11 '21 at 13:43

1 Answers1

0

It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:

import numpy as np

x = np.array(list(b.values()))

However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/

Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.

Desh Raj
  • 113
  • 1
  • 4