29

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:

df = pd.read_csv('Check400_900.csv', sep='\t')

Doesn't work so I found iterate and chunksize in a similar post so I used:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

All good, i can for example print df.get_chunk(5) and search the whole file with just:

for chunk in df:
    print chunk

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.

plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Thodoris P
  • 553
  • 1
  • 6
  • 11

4 Answers4

36

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):

Then use concat for all chunks to df, because type of output of function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

isn't dataframe, but pandas.io.parsers.TextFileReader - source.

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)

I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.

EDIT:

But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    He can just do `tp.read()` to get a dataframe but it probably won't fit into his memory so he asks if you could work without loading it fully. – hellpanderr Nov 11 '15 at 09:38
  • I have tried that but i run out of memory again, 'MemoryError' – Thodoris P Nov 11 '15 at 10:58
  • The size of my memory is 6gb – Thodoris P Nov 11 '15 at 11:03
  • I get error NotImplementedError: 'nrows' and 'chunksize' can not be used together yet. Should i remove chunksize? – Thodoris P Nov 11 '15 at 11:11
  • Well (for some reason i dont know) i have 3 Unnamed columns with just a counter, and then i have 4 columns which i need – Thodoris P Nov 11 '15 at 11:31
  • Yes my csv file is separated with tab and you can take a look at how my df looks like here [link](http://prntscr.com/91imra) p.s. Thanks for the help – Thodoris P Nov 11 '15 at 11:47
  • 12
    This is a terrible idea, for exactly the reason @hellpanderr suggested in the first comment. This has the same effect as just calling `read_csv` without using `chunksize`, except that it takes twice as much memory (because you now have to hold not only the giant DataFrame, but also all the chunks that add up to that DataFrame at the same time). This cannot possibly solve the OP's problem, it only makes it worse. – abarnert May 17 '18 at 06:29
  • It's _not_ working. He says it works "only for 10000 rows". His actual data is 11 million rows. You apparently managed to convince him, incorrectly, that what he wanted to do was impossible. And that's not going to change two and a half years later. So there's no point in discussing it further. – abarnert May 17 '18 at 06:58
21

You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.

But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.

The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.

When you say this:

My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk

The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.

Instead of this:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')

… you have to do things like this:

for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
    print df.dtypes
    customer_group3 = df.groupby('UserID')

Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.

Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    Slow compared to what? Calling `groupby` on a 1000-row chunk takes exactly as much time as calling `groupby` on a 1000-row whole-file DataFrame. Which takes under 1ms on my laptop. If you have to do that 1000 times, it takes almost a second. Is that too slow to process a file that you otherwise couldn't process at all? – abarnert May 17 '18 at 06:33
6

You need to concatenate the chucks. For example:

df2 = pd.concat([chunk for chunk in df])

And then run your commands on df2

user29791
  • 332
  • 1
  • 4
2

This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.

df = pd.read_csv("data.csv", 
            usecols=['A', 'B', 'C', 'Date'],
            dtype={'A':'uint32',
                    'B':'uint8',
                    'C':'uint8'
                    },
            parse_dates=['Date'],  # convert to datetime64          
            sep='\t'
           )
KostasM
  • 33
  • 1
  • 6
  • If this doesn't answer the question, you can add this information as a comment. – Erwol Feb 12 '22 at 19:46
  • 2
    I must have at least 50 reputation to comment :/. I thought this suggestion would be useful to be here, for anyone working on "big" datasets. (I can comment only to my own answer) – KostasM Feb 13 '22 at 09:18