0

I am working on calculating tf-idf in a large document. The number of words I have is more than 80,000. I am trying to write sparse matrix in a csv file. I am using code similar to answered here How to add a new column to a CSV file using Python?

The output file is too big in size, exceeding 700 MB for about 30,000 words only. So, my question is how to write it efficiently? Thank you.

Community
  • 1
  • 1
hshed
  • 657
  • 2
  • 8
  • 21
  • 1
    If you're writing a sparse matrix to CSV, there's really not much you can do about the file size. Would compression solve your needs? You'd get an amazing compression ratio with a file that's mostly commas. – David Cain Mar 17 '13 at 18:29
  • 2
    Additionally, are you just trying to save the information to disk, or are you set on using the .csv format? If the former is true, you have many more options. – David Cain Mar 17 '13 at 18:29
  • 1
    @David think you've covered all the points I was going to make - This question definitely needs to be more clearly defined – Jon Clements Mar 17 '13 at 18:31
  • Have you evaluated existing software that computes and stores tf-idf for large documents? Sphinx for example is open source, written in C++ and very space + memory + speed efficient. There is an API for Python. http://sphinxsearch.com/ – Tobia Mar 17 '13 at 18:36
  • @David I have to use this matrix for further stuff. Any suggestions on how should I proceed? I guess saving it in csv is not the optimum solution. – hshed Mar 17 '13 at 18:46

2 Answers2

14

You can easily write directly a gzip file by using the gzip module :

import gzip
import csv

f=gzip.open("myfile.csv.gz", "w")
csv_w=csv.writer(f)
for row in to_write :
    csv_w.writerow(row)
f.close()

Don't forget to close the file, otherwise the resulting csv.gz file might be unreadable.

You can also do it in a more pythonic style :

with gzip.open("myfile.csv.gz", "w") as f :
    csv_w = csv.writer(f)
    ...

which guarantees that the file will be closed.

Hope this helps.

Dvx
  • 289
  • 2
  • 10
2

CSV is CSV and there is not much you can do about it. You can simply gzip it, if you really want to stick with CSV, or you can use some custom format that better fits your needs.

For example you can use a dictionary and export to JSON format, or create a dedicated object that handles your data and pickle it.

When I worked with TF-IDF, I used sqlite (via sqlalchemy) to store documents information, with TF data as dictionary in JSON format. From that I created IDF stats, and later did rest of TFIDF, using numpy

Jakub M.
  • 32,471
  • 48
  • 110
  • 179
  • Thanks for letting me know about pickle module. I am not using csv now , and pickle file seems to work great for me! – hshed Mar 18 '13 at 14:21