Python Pandas: Write certain rows in file

Question

The csv file is way to big, so I am reading it chunk by chunk. Therefore, I use read_csv with chunksize.

I want to store all rows, where the last entry has the value 1 in one file and all the other rows where the last entry is 0 in another file.

Suppose it looks like this:

ID   A    B   C
0   0.0  0.1  1
1   0.1  0.2  0
2   0.1  0.0  1

So, I want to store row with ID 0 and 2 in one file and the row with ID 1 in another file.

How do I do that with pandas?

score 1 · Answer 1 · edited May 23 '17 at 12:16

1

From this post:

reader = pd.read_csv('big_table.txt', sep='\t', header=0, 
                     index_col=0, usecols=the_columns_i_want_to_use, 
                     chunksize=10000)

df = pd.concat([ chunk.ix[rows_that_I_want_] for chunk in reader ])

But instead make 2 data frames:

df0 = pd.concat([ chunk[chunk["C"] == 0] for chunk in reader ])
df1 = pd.concat([ chunk[chunk["C"] == 1] for chunk in reader ])

Then save each data frame independently

edited May 23 '17 at 12:16

Community

1
1

answered Sep 21 '16 at 21:28

Kyrubas

877
8
23

But I read it chunk by chunk and according to the docs I get a TextFileReader Object and not a data frame – Sep 21 '16 at 21:36
Updated it for considering chunk size. The TextFileReader object is a generator, so you need to iterate through it. You can make a bunch of smaller dataframes with a a list comprehension then concatenate them together. – Kyrubas Sep 21 '16 at 21:50
What do you mean by generator? – Sep 21 '16 at 22:04
The data is produced on demand so that it takes up only the memory for each chunk when that chunk is called upon: http://www.python-course.eu/generators.php – Kyrubas Sep 21 '16 at 22:07
Something else just came up. If I use chunk["C"] == 1 in combination with an if like if chunk["C"] == 1 I get ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Do you know how to overcome this? – Sep 21 '16 at 22:20
To do that you're going to need to use `map` in combination with `lambda`. `chunk["C"] == 1` already returns `True` or `False` for each element in the series. This series of `True` or `False` gets applied back to `chunk` such that you're left only with the `True` rows. Ex `chunk["C"] == 1` applied to your original data frame would return `[True, False, True]' which in turn when used against the dataframe itself gives just row IDs 0 and 2 – Kyrubas Sep 21 '16 at 23:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123902/discussion-between-kyrubas-and-tumbler). – Kyrubas Sep 21 '16 at 23:36
This won't work, because there is too much to store in memory, correct? Did it work? – Kartik Sep 22 '16 at 04:55

score 0 · Answer 2 · answered Sep 22 '16 at 04:59

I would simply do it like this:

first = True
df = pd.read_csv('file.csv', chunksize=1e5)
for chunk in df:
    if first:
        chunk[chunk['C'] == 1].to_csv('ones.csv', header=True)
        chunk[chunk['C'] == 0].to_csv('zero.csv', header=True)
        first = False
    chunk[chunk['C'] == 1].to_csv('ones.csv', header=False)
    chunk[chunk['C'] == 0].to_csv('zero.csv', header=False)

Python Pandas: Write certain rows in file

2 Answers2