How to iterate over dataframe such that all rows which have a specific column value in common are saved to their respective files?

Question

This questions was a little harder for me to phrase so I request to help edit the question which would make more sense (if necessary).

Problem Statement: I want all the rows which have a specific column value in common, saved to same file.

Example Code I want to do something like this. Say, I have a dataframe:

d = {'col1': [1, 2, 6, 3, 4], 'col2': [3, 4, 2, 5, 6], 'col3':['a', 'b', 'c', 'a', 'b'], 'col4':['2', '3', '2', '2', '2']}
df = pd.DataFrame(data=d)

I want to create csv files such that:

all rows where col3 is a, they all get saved in a.csv all rows
where col3 is b, they all get saved in b.csv all rows where
col3 is c, they all get saved in c.csv

Hypothesized Solution: The only way I can think of creating the CSV files is iterating through the dataframe per row and checking if the column (e.g. col3 val) has a csv created already, if not -- create and add the rows or else append to exists csv file.

Issue: Above sample code is just a representation. I have a very large dataframe. If it helps, I know the unique value in the column in question (like, col3 is example) as a list somewhere. However, on of the most popular answer on how to iterate over a dataframe? : How to iterate over rows in a DataFrame in Pandas says (in the second answer there) that DON'T. I might have to use it as a last resort if there is no other way but if there is one, can someone help me get a better solution to this problem?

Corralien · Accepted Answer · 2021-05-17T09:19:38.517

2

If your file (here all.csv) is large and you want to process csv in chunks, you can try this strategy: open a file when the first occurrence is met and save the handle into a dict. Next when you meet the same occurrence, load the handle and use it to write the data and so on.

import pandas as pd
import pathlib

DIRPATH = "/tmp/csv_folder"

# create folder if it doesn't exist
dirpath = pathlib.Path(DIRPATH)
dirpath.mkdir(parents=True, exist_ok=True)

# chunksize=2 for demo purpose only...
reader = pd.read_csv("all.csv", chunksize=2)
streams = {}

for df in reader:
    for grp, dfg in df.groupby("col3"):
        try:
            buffer = streams[grp]
            dfg.to_csv(buffer, index=False, header=False)
        except KeyError:
            # grp is met for the first time
            buffer = open(dirpath / f"{grp}.csv", "w")
            streams[grp] = buffer
            dfg.to_csv(buffer, index=False)

for fp in streams.values():
    fp.close()

$ cat /tmp/csv_folder/a.csv
col1,col2,col3,col4
1,3,a,2
3,5,a,2

$ cat /tmp/csv_folder/b.csv
col1,col2,col3,col4
2,4,b,3
4,6,b,2

$ cat /tmp/csv_folder/c.csv
col1,col2,col3,col4
6,2,c,2

edited May 17 '21 at 09:19

answered May 17 '21 at 07:26

Corralien

109,409
8
28
52

Thanks. May I now what are `grp` and `dfg` representing? I saw the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html but couldn't understand. – Sulphur May 17 '21 at 07:35
`grp` is the value of the group ("a", "b", "c"), i.e. the unique values of `col3` (try: `df["col3"].unique()`. `dfg` is the subset dataframe of `df` where all values of `col3` is `grp` (try: `df.loc[df["col3"] == "a"]`) – Corralien May 17 '21 at 07:39
Thank you! Great. Quick question. Currently files (here csv) are being stored in the same directory where the notebook is run. if I want to save all csv files in another directory which is a variable (like `/csv_folder`) how does `f"{grp}.csv"` change? – Sulphur May 17 '21 at 07:44
1

@Sulphur. I updated my answer according to your request. – Corralien May 17 '21 at 07:56
I more question, please. Does this method only work when we do one pass of the dataframe and see all unique values and create CSV? I ask this because I will be working with chunks of dataframe (like here:https://stackoverflow.com/questions/67155322/how-to-read-chunk-from-middle-of-a-long-csv-file-using-python-200-gb). So, in first pass one chunk (~=1000 rows) of df gets respective rows saved into respective csv file. When I do second pass (i.e. next ~1000 rows) in next iteration, do the same as above, will the values be appended to existing csv files if they exist, or will they be overwriten? – Sulphur May 17 '21 at 08:09
1

@Sulphur, try my solution to read csv in chunks. – Corralien May 17 '21 at 09:22

score 2 · Answer 2 · answered May 17 '21 at 07:26

2

You can TRY:

vals = ['a','b','c']
for i in vals:
    df.loc[df.col3.eq(i)].to_csv(f"{i}.csv", index= False)

NOTE: if you wanna do this for all different values in col3, then you can use unique:

for i in df.col3.unique():
    df.loc[df.col3.eq(i)].to_csv(f"{i}.csv", index= False)

answered May 17 '21 at 07:26

Nk03

14,699
2
8
22

In my case, second will be probably more helpful. Thank you! – Sulphur May 17 '21 at 07:45

score 0 · Answer 3 · answered May 17 '21 at 07:39

I have one idea First, get the list of items in col_3. Then, using the list, split the data and convert it to csv. I hope it does what you want it to do.

col_3list = list(sorted(set(df['col3'])))
for i in range(len(col_3list)):
  new_df = df[df['col3']==col_3list[i]]
  csv_writer = new_df.to_csv('data_sample'+str(i)+'.csv')

How to iterate over dataframe such that all rows which have a specific column value in common are saved to their respective files?

3 Answers3