0

I have a large text file (400 MB) containing data in a format like so:

805625228 linked to 670103907:0.981545
805829325 linked to 901909901:0.981545
803485795 linked to 1030404117:0.981545
805865780 linked to 811300706:0.981545

ID linked to ID:Probability_of_link

...
...
....
...
...

The text file contains millions of such entries, and I have several such text files. As part of analyzing the data, I parse the data multiple times (each of the text files are in different formats). When parsing and working with the data in Python, I notice my memory usage shoot up to 3 GB at times.

What would be a better approach than dumping this data to text files? Could I store it in a json/sql database; and how much of a performance boost would it give me? What kind of database would be best suited to this data?

FYI, all the data shown above was produced from structured .csv files containing millions of rows.

Community
  • 1
  • 1
lostsoul29
  • 746
  • 2
  • 11
  • 19
  • You might be calling all the data at once in memory so this shooting things occur .What you should do is call limited data from csv.Rest depends on your code. – Nikhil Parmar Jan 29 '16 at 05:19
  • I'm reading the text file line by line, so it looks like the previous line is garbage collected immediately [http://stackoverflow.com/questions/6475328/read-large-text-files-in-python-line-by-line-without-loading-it-in-to-memory] – lostsoul29 Jan 29 '16 at 17:11

0 Answers0