-1

I have a 1000 texts each text has 200-1000 words. size of text csv file is about 10 MB. when I vectorize them with this code, the size of output CSV is exceptionally big (2.5 GB). I am not sure what I did wrong. Your help is highly appreciated. Code:

import numpy as np
import pandas as pd
from copy import deepcopy
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy import savetxt
df = pd.read_csv('data.csv')
#data has two columns: teks and groups
filtered_df = deepcopy(df)
vectorizer = TfidfVectorizer()
vectorizer.fit(filtered_df["teks"])
vector = vectorizer.transform(filtered_df["teks"])
print(vector.shape)     # shape (1000, 83000)
savetxt('dataVectorized1.csv', vector.toarray(), delimiter=',')
desertnaut
  • 57,590
  • 26
  • 140
  • 166
tursunWali
  • 71
  • 8

1 Answers1

0

Sparse matrices (like your vector here) are not supposed to be converted to dense ones (as you do with .toarray()) and saved as CSV files; doing that makes no sense, and invalidates the whole concept of sparse matrices itself. Given that, the big size is not a surprise.

You should seriously consider saving your sparse vector to an appropriate format, e.g. using scipy.sparse:

import scipy.sparse
scipy.sparse.save_npz('dataVectorized1.npz', vector)

See also Save / load scipy sparse csr_matrix in portable data format for possible other options.

If, for any reason, you must stick to a CSV file for storage, you could try compressing the output file by simply using the .gz extension in the file name; from the np.savetxt() documentation:

If the filename ends in .gz, the file is automatically saved in compressed gzip format. loadtxt understands gzipped files transparently.

So, this should do the job:

np.savetxt('dataVectorized1.csv.gz', vector.toarray(), delimiter=',')

However, I would not really recommend this; keep in mind that:

  1. Beyond their convenience for tutorials and introductory exhibitions, CSV files do not really hold any "special" status as inputs to ML tasks, as you might seem to believe.
  2. There is absolutely no reason why the much more efficient .npz file cannot be used as input for further downstream tasks, like classification, visualization, and clustering; on the contrary, its use is very much justified and recommended in similar situations.
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • after your two lines (i'm not exactly sure): #way one: 'arr=dataVectorized.npz' 'arr.tofile('dataVectorized.csv', sep = ',')' #way two: 'arr=np.read(dataVectorized.npz) ' # convert array into dataframe 'DF = pd.DataFrame(arr)' # save the dataframe as a csv file 'DF.to_csv("dataVetorized.csv")' – tursunWali Apr 04 '21 at 21:30
  • @tursunWali not sure what you are trying to say here, or why you seem to insist in saving as CSV (*not* a good idea, or even necessary); better leave the code out (it never looks good in comments) and explain? – desertnaut Apr 04 '21 at 21:53
  • yes, I want to save in CSV file. I want to give the CSV as an input to other processes, like classification, clustering, visualization for example. – tursunWali Apr 05 '21 at 04:31
  • I tested your solution. It works out gz file is about 17 MB big, but when I extract the compressed file, it's true size revealed 2.78 GB, is similar to what I got with top most code. I think this is not a proper solution. Normal size should be 14 MB, I tried with TFIDF with another python module. However I still want to reduce size of our output file "dataVectorized1", This solution in my opinion, is more transparent. – tursunWali Apr 06 '21 at 02:40
  • @tursunWali Extracting the compressed file was *not* part of the suggested solution, neither it was implied that extracting the zipped file would lead in a smaller size. It is also puzzling why you think that "*normal size should be 14 MB*" (it will never be, and there is a reason why TFIDF output is by default sparse), why you cannot do your job with a `npz` or a `gz` file, or what you mean by "*transparent*" (to the human eye?). Anyway, good luck... – desertnaut Apr 06 '21 at 08:37