16

I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.

If I prepopulate a dataframe as below, it consumes ~26GB of RAM.

h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

System info:

  • python 2.7.9
  • ipython 2.3.1
  • numpy 1.9.1
  • pandas 0.15.2.

Any ideas welcome.

firelynx
  • 30,616
  • 9
  • 91
  • 101
Chris F.
  • 773
  • 6
  • 15
  • What Python version is this? – Simeon Visser Jan 29 '15 at 16:42
  • does it act differently if you transpose the data? 10^3 rows and 10^6 seems...backwards. – Paul H Jan 29 '15 at 16:44
  • what does this mean: "data is all 0/1/2"? – Paul H Jan 29 '15 at 16:45
  • @PaulH: Probably means his data in the rows are all just 0s, 1s, and 2s. – WGS Jan 29 '15 at 16:46
  • @PaulH: sorry, exactly that. it's genotype data, literally the characters 0, 1, and 2. – Chris F. Jan 29 '15 at 16:46
  • 1
    I'm curious about what the contents of `ms.txt` are. You call `readline()` on it, which means it's a multi-line text file, but then you `split` it. Can you post maybe the first 10 rows just to be sure? – WGS Jan 29 '15 at 16:48
  • @SimeonVisser: python 2.7.9, ipython 2.3.1, numpy 1.9.1, pandas 0.15.2. – Chris F. Jan 29 '15 at 16:48
  • @TheLaughingMan: that's just to get the header information to specify the number of columns in the data. The first three data rows look like this (in numpy): array([[ 0., 0., 0., ..., 0., 0., 0.], [ 1., 1., 0., ..., 1., 0., 0.], [ 1., 0., 0., ..., 1., 0., 1.], – Chris F. Jan 29 '15 at 16:50
  • I see, so it's `readline` and not `readlines`. My bad. The first one of course only reads one line. By any chance, what's the file size of `ms.txt`? – WGS Jan 29 '15 at 16:51
  • Currently processing like this: we'll see what happens. https://gist.github.com/cfriedline/9b462b1f4696b2e6dcc3 – Chris F. Jan 29 '15 at 16:52
  • @TheLaughingMan 6.5GB ish. – Chris F. Jan 29 '15 at 16:52
  • This may be super naive, but why isn't that the right memory? For example if I do `np.zeros((1100, 3000000)).nbytes / 1e9` I get `26.4`. The dtype is `float64`. – ely Jan 29 '15 at 16:53
  • try telling `read_csv` that everything will be an integer. – Paul H Jan 29 '15 at 16:53
  • @prpl.mnky.dshwshr: 26.4GB i can deal with, per the title of this post 170GB is crazy and weird. – Chris F. Jan 29 '15 at 16:55
  • I'm assuming you've tried doing it like this: `with open("ms.txt") as f: header = [x.split("\t") for x in f.readline()]` ? – WGS Jan 29 '15 at 16:55
  • @TheLaughingMan: check out the gist link ;-) – Chris F. Jan 29 '15 at 16:55
  • This is insane. I've seen a `pandas` benchmark test before with 50GB of data used and I don't remember it using 170GB of RAM. – WGS Jan 29 '15 at 16:57
  • @TheLaughingMan Agreed. I've got a box with huge ram that I can run this on, but it didn't seem to want to stop. I killed it manually at 170GB. Who knows how big it would have gotten? – Chris F. Jan 29 '15 at 16:58
  • Well if I try `np.zeros((1100, 3000000), dtype=object)` it just hangs, but I'm guessing I'm going to see memory consumption much much higher. Perhaps `read_csv` is doing this, and making some copies of a few things, as it attempts to discern data types while reading? – ely Jan 29 '15 at 17:06
  • @prpl.mnky.dshwshr perhaps. once i get the work done I actually have to do, i'll experiment with the data type. – Chris F. Jan 29 '15 at 17:11
  • 3
    Digging through the stuff under `read_csv` it looks like in the generic case it bottoms out with `pandas.io.parsers.PythonParser.read` which does appear to make copies during date converstion and in `_convert_data` which calls to `_convert_to_ndarrays` which calls to `_convert_types` which then has further calls to some functions like `maybe_convert_numeric`, etc. Anywhere along this trail of code you could be getting blowup from `object` type and from inefficient copying. – ely Jan 29 '15 at 17:27
  • 2
    To come at it from the other side: by manually creating a DataFrame 1100x3M of dtype int8, the total memory usage after construction should be about ~3.1G as expected. In the past, there have been corners of pandas which don't handle the many-columns-few-rows limits very well, so that could also be playing a role. – DSM Jan 29 '15 at 17:34
  • Thanks everyone, adding a numpy int16 array is just under 7GB which works fine for my purposes. Glad I'm not totally crazy. Will def try out dtype in read_csv in a bit. – Chris F. Jan 29 '15 at 17:36

2 Answers2

8

Problem of your example.

Trying your code on small scale, I notice even if you set dtype=int, you are actually ending up with dtype=object in your resulting dataframe.

header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

df.dtypes
a    object
b    object
c    object
dtype: object

This is because even though you give the pd.read_csv function the instruction that the columns are dtype=int, it cannot override the dtypes being ultimately determined by the data in the column.

This is because pandas is tightly coupled to numpy and numpy dtypes.

The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN, which does not fit in an integer.

This means numpy gets confused and defaults back to the dtype being object.

Problem of the object dtype.

Having the dtype set to object means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.

Workaround for your example.

df = pd.DataFrame(columns=header, index=range(rows), dtype=float)

This works just fine, since np.NaN can live in a float. This produces

a    float64
b    float64
c    float64
dtype: object

And should take less memory.

More on how to relate to dtypes

See this related post for details on dtype: Pandas read_csv low_memory and dtype options

Community
  • 1
  • 1
firelynx
  • 30,616
  • 9
  • 91
  • 101
0

The similar problem i had faced with 3 GB data today and i just did little change in my coding style like instead of file.read() and file.readline() method i used below code, that below code just load 1 line at a time in ram

import re

df_list = []

with open("ms.txt", 'r') as f:
    for line in f:
        #process(line)
        line = line.strip()
        columns = re.split("\t", line, maxsplit=4) # you should modify these according to your split criteria
        df_list.append(columns)

Here is code to convert your data into pandas dataframe.

import pandas as pd
df = pd.DataFrame(df_list)# here you will have to modify according to your data frame needs
Shubham Sharma
  • 2,763
  • 5
  • 31
  • 46