You do not need concat
here. It's exactly like writing sum(map(list, grouper(tup, 1000)))
instead of list(tup)
. The only thing iterator
and chunksize=1000
does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.
But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError
, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize
is for.
The problem is that you named the resulting iterator df
, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.
When you say this:
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.
Instead of this:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')
… you have to do things like this:
for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
print df.dtypes
customer_group3 = df.groupby('UserID')
Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby
each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.
Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator
and chunksize
solve the problem: by allowing you to make this tradeoff when you need to.