2

The csv file is way to big, so I am reading it chunk by chunk. Therefore, I use read_csv with chunksize.

I want to store all rows, where the last entry has the value 1 in one file and all the other rows where the last entry is 0 in another file.

Suppose it looks like this:

ID   A    B   C
0   0.0  0.1  1
1   0.1  0.2  0
2   0.1  0.0  1

So, I want to store row with ID 0 and 2 in one file and the row with ID 1 in another file.

How do I do that with pandas?

2 Answers2

1

From this post:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, 
                     index_col=0, usecols=the_columns_i_want_to_use, 
                     chunksize=10000)

df = pd.concat([ chunk.ix[rows_that_I_want_] for chunk in reader ])

But instead make 2 data frames:

df0 = pd.concat([ chunk[chunk["C"] == 0] for chunk in reader ])
df1 = pd.concat([ chunk[chunk["C"] == 1] for chunk in reader ])

Then save each data frame independently

Community
  • 1
  • 1
Kyrubas
  • 877
  • 8
  • 23
  • But I read it chunk by chunk and according to the docs I get a TextFileReader Object and not a data frame –  Sep 21 '16 at 21:36
  • Updated it for considering chunk size. The TextFileReader object is a generator, so you need to iterate through it. You can make a bunch of smaller dataframes with a a list comprehension then concatenate them together. – Kyrubas Sep 21 '16 at 21:50
  • What do you mean by generator? –  Sep 21 '16 at 22:04
  • The data is produced on demand so that it takes up only the memory for each chunk when that chunk is called upon: http://www.python-course.eu/generators.php – Kyrubas Sep 21 '16 at 22:07
  • Something else just came up. If I use chunk["C"] == 1 in combination with an if like if chunk["C"] == 1 I get ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Do you know how to overcome this? –  Sep 21 '16 at 22:20
  • To do that you're going to need to use `map` in combination with `lambda`. `chunk["C"] == 1` already returns `True` or `False` for each element in the series. This series of `True` or `False` gets applied back to `chunk` such that you're left only with the `True` rows. Ex `chunk["C"] == 1` applied to your original data frame would return `[True, False, True]' which in turn when used against the dataframe itself gives just row IDs 0 and 2 – Kyrubas Sep 21 '16 at 23:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123902/discussion-between-kyrubas-and-tumbler). – Kyrubas Sep 21 '16 at 23:36
  • This won't work, because there is too much to store in memory, correct? Did it work? – Kartik Sep 22 '16 at 04:55
0

I would simply do it like this:

first = True
df = pd.read_csv('file.csv', chunksize=1e5)
for chunk in df:
    if first:
        chunk[chunk['C'] == 1].to_csv('ones.csv', header=True)
        chunk[chunk['C'] == 0].to_csv('zero.csv', header=True)
        first = False
    chunk[chunk['C'] == 1].to_csv('ones.csv', header=False)
    chunk[chunk['C'] == 0].to_csv('zero.csv', header=False)
Kartik
  • 8,347
  • 39
  • 73