1

I have the following code:

for state in state_list:
    state_df = pd.DataFrame()
    for df in pd.read_csv(tax_sample,sep='\|\|', engine='python', dtype = tax_column_types, chunksize = 10, nrows = 100):
        state_df = pd.concat(state_df,df[df['state'] == state])
state_df.to_csv('property' + state + '.csv')

My dataset is quite big, and I'm breaking it into chunks (in reality these would be bigger than 10 obs). I'm taking each chunk and checking if the state matches a particular state in a list, and, if so, store it in a dataframe and save it down. In short, I'm trying to take a dataframe with many different states in it and break it into several dataframe, each with only one state and save to CSV.

however, the code above gives the error:

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

Any idea why?

Thanks,

Mike

Mike
  • 1,049
  • 5
  • 20
  • 46
  • You may want to read how `pd.concat()` works. You call it incorrectly. – DYZ May 18 '18 at 22:21
  • 1
    `pd.concat(state_df,df[df['state'] == state])` should be `pd.concat([state_df,df[df['state'] == state]])`, notice the argumnt passed in `[]` so that it is a list of dataframes. – harvpan May 18 '18 at 22:52
  • 1
    Somewhere it is written on Pandas Stone Tablets: [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait May 19 '18 at 16:11

1 Answers1

0

Consider iterating off the chunks and each time run .isin[] for filter on state_list but save in a container like dict or list. As commented, avoid the overhead of expanding dataframes in a loop.

Afterwards, bind with pd.concat on container and then run a looped groupby on state field to output each file individually.

df_list = []

reader = pd.read_csv(tax_sample, sep='\|\|', engine='python', 
                     dtype=tax_column_types, chunksize=10, nrows=100)

for chunk in reader:
   tmp = chunk[chunk['state'].isin(state_list)]    
   df_list.append(tmp)

master_df = pd.concat(df_list)

for g in master_df.groupby('state'):
    g[1].to_csv('property' + g[0] + '.csv')
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks for the response. I don't quite understand what is happening in the above. I see that we are reading the file in in chunks, such that "reader" contains 10 chunks. But how does this reduce the amount of memory required? Isn't is all still stored in memory, but just in multiple dataframes? – Mike May 21 '18 at 13:17
  • Did you attempt the code and test performance? If you read the link, you avoid the memory footprint of reading entire data file all at once. It is akin to python's [readlines() vs looping over read object](https://stackoverflow.com/a/17246300/1422451) of text files. The memory efficiency issue is not the size of data held in RAM but when/how the OS allocates RAM for data. – Parfait May 21 '18 at 13:48
  • Yes, I've tried running it and it still throws a memory issue. I'm breaking into chunks of 1,000, and there are a couple of million rows. – Mike May 21 '18 at 14:35
  • At what line does the memory issue raise? Check by commenting out lines. What if you just ran `reader=`? Then the `for chunk` loop? I wonder if `pd.concat()` is the issue or even `groupby()`. Try also running at command line and not any IDE (Spyder/PyCharm). – Parfait May 21 '18 at 14:56
  • Busy running it now, thanks, will let you know where I run into issues – Mike May 21 '18 at 15:25
  • just finished running it now. It is the 'tmp =chunk...' line that is crashing it – Mike May 21 '18 at 17:29
  • If I understand the code you provided correctly, isn't every row still being stored in the master_df dataframe, rather than separate dataframes being created for each state? – Mike May 21 '18 at 17:53
  • No that `tmp` line filters rows with `isin` and appends filtered dataframe to list. Otherwise, I advise storing data in a database like opensource SQLite. Import a state table, index various columns for quick table scans, and run join query on large table, then import result into pandas. See its `read_sql` method. – Parfait May 21 '18 at 18:04