12

I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t"). The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv(). It is a file with 305 columns and +/- 1 000 000 rows.

When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20 in read_csv() it works fine.

The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.

My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()? Am I doing anything wrong?

EDIT: When executing df.dtypes the types are a mix of object, float64, and int64

UPDATE: I used the following code to overcome the problem and perform my calculations:

summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
    x=x+1
    sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
    summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
    del sample_col

it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column

Robvh
  • 1,191
  • 1
  • 11
  • 22
  • 1
    did you check this? https://stackoverflow.com/questions/38487334/pandas-python-memory-spike-while-reading-3-2-gb-file – PV8 Jun 19 '19 at 06:47
  • Thank you for the link! This did not pop up when searching for information! – Robvh Jun 19 '19 at 06:56
  • One possible solution is to read the file in chunks. See [this other question](https://stackoverflow.com/a/52964494/8107620). – AlCorreia Jun 19 '19 at 13:26
  • @AlCorreia I also considered that solution but as I have to calculate the sum of an entire column, loading chunks of rows did not seem a good solution. Iterating over the columns (see my update) works perfectly. – Robvh Jun 19 '19 at 14:06

1 Answers1

14

Pandas is cutting up the file, and storing the data individually. I don't know the data types, so I'll assume the worst: strings.

In Python (on my machine), an empty string needs 49 bytes, with an additional byte for each character if ASCII (or 74 bytes with extra 2 bytes for each character if Unicode). That's roughly 15Kb for a row of 305 empty fields. A million and a half of such rows would take roughly 22Gb in memory, while they would take about 437 Mb in a CSV file.

Pandas/numpy are good with numbers, as they can represent a numerical series very compactly (like C program would). As soon as you step away from C-compatible datatypes, it uses memory as Python does, which is... not very frugal.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • thank you for your reply! When I execute ```type(df[df.columns[1]][5]]``` (the 1 and 5 are arbitrary numbers) python returns `````` . These are not strings so does the memory spike still makes sense? – Robvh Jun 19 '19 at 07:03
  • That just proves one column. Check `df.dtypes`. If any of them is `object`, that column is stored as Python objects. The other possibility is... you really do need more memory than you have. How many rows in the file? – Amadan Jun 19 '19 at 07:08
  • I edited my original post for the df.dtypes. What is the consequence when it is storyed as ```object``` ? – Robvh Jun 19 '19 at 07:11
  • `object` dtype means it is stored as a Python object. `sys.getsizeof(1)` will tell you that (at least on my computer) Python represents an integer with 28 bytes of memory. `sys.getsizeof("")` will confirm what I said before about Python strings. However, both `np.float64` and `np.int64` use only 8 bytes (=64 bits) each. – Amadan Jun 19 '19 at 07:14
  • for me Python returns the following: ```sys.getsizeof("object")``` = 31 - ```sys.getsizeof("float64")``` = 32 - ```sys.getsizeof("int64")``` = 30. These numbers are way higher than your values? How is this possible? – Robvh Jun 19 '19 at 07:18
  • Yeah, you likely have a 32-bit version of Python, so the amounts are not as on mine, but *smaller*. You just confirmed that the *string* with contents of `"object"` takes up 25 characters plus six per character. The string `"float64"` has one character more, so it takes up one byte more. You can't measure memory spending of `np.int64` directly, since Python will wrap it, but you can see the memory growth when you expand a numpy array: `sys.getsizeof(np.array([1, 2, 3])) - sys.getsizeof(np.array([1, 2]))` should give you 8 bytes of difference. – Amadan Jun 19 '19 at 07:23
  • Note that all values in a column need to be of the same numpy type, or numpy will make an `object` column - even if you just have *one single measly* value somewhere that doesn't fit. – Amadan Jun 19 '19 at 07:27
  • Ahh that makes sense. Thanks for the information. Do you have any suggestion what I could do? I am thinking of splitting up the file in columns as I need to make calculations per columns (e.g. ```df["col 1"].sum()```)and not between columns. – Robvh Jun 19 '19 at 07:28
  • You can use `usecols` parameter of `read_csv` to limit what you're reading, if you don't need all the columns. You can also try to see why a column is getting the `object` type, and try to fix it (e.g. by seeing if there is a non-numeric value in the column, and adding such a value to `na_values` parameter). – Amadan Jun 19 '19 at 07:31
  • Thank you for your help! Very much appreciated. – Robvh Jun 19 '19 at 07:43
  • for your interest, the number of rows is +/- 1 000 000 – Robvh Jun 19 '19 at 11:33