1

This questions was a little harder for me to phrase so I request to help edit the question which would make more sense (if necessary).

Problem Statement: I want all the rows which have a specific column value in common, saved to same file.

Example Code I want to do something like this. Say, I have a dataframe:

d = {'col1': [1, 2, 6, 3, 4], 'col2': [3, 4, 2, 5, 6], 'col3':['a', 'b', 'c', 'a', 'b'], 'col4':['2', '3', '2', '2', '2']}
df = pd.DataFrame(data=d)

enter image description here

I want to create csv files such that:

  • all rows where col3 is a, they all get saved in a.csv all rows
  • where col3 is b, they all get saved in b.csv all rows where
  • col3 is c, they all get saved in c.csv

Hypothesized Solution: The only way I can think of creating the CSV files is iterating through the dataframe per row and checking if the column (e.g. col3 val) has a csv created already, if not -- create and add the rows or else append to exists csv file.

Issue: Above sample code is just a representation. I have a very large dataframe. If it helps, I know the unique value in the column in question (like, col3 is example) as a list somewhere. However, on of the most popular answer on how to iterate over a dataframe? : How to iterate over rows in a DataFrame in Pandas says (in the second answer there) that DON'T. I might have to use it as a last resort if there is no other way but if there is one, can someone help me get a better solution to this problem?

Sulphur
  • 514
  • 6
  • 24

3 Answers3

2

If your file (here all.csv) is large and you want to process csv in chunks, you can try this strategy: open a file when the first occurrence is met and save the handle into a dict. Next when you meet the same occurrence, load the handle and use it to write the data and so on.

import pandas as pd
import pathlib

DIRPATH = "/tmp/csv_folder"

# create folder if it doesn't exist
dirpath = pathlib.Path(DIRPATH)
dirpath.mkdir(parents=True, exist_ok=True)

# chunksize=2 for demo purpose only...
reader = pd.read_csv("all.csv", chunksize=2)
streams = {}

for df in reader:
    for grp, dfg in df.groupby("col3"):
        try:
            buffer = streams[grp]
            dfg.to_csv(buffer, index=False, header=False)
        except KeyError:
            # grp is met for the first time
            buffer = open(dirpath / f"{grp}.csv", "w")
            streams[grp] = buffer
            dfg.to_csv(buffer, index=False)

for fp in streams.values():
    fp.close()
$ cat /tmp/csv_folder/a.csv
col1,col2,col3,col4
1,3,a,2
3,5,a,2

$ cat /tmp/csv_folder/b.csv
col1,col2,col3,col4
2,4,b,3
4,6,b,2

$ cat /tmp/csv_folder/c.csv
col1,col2,col3,col4
6,2,c,2
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks. May I now what are `grp` and `dfg` representing? I saw the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html but couldn't understand. – Sulphur May 17 '21 at 07:35
  • `grp` is the value of the group ("a", "b", "c"), i.e. the unique values of `col3` (try: `df["col3"].unique()`. `dfg` is the subset dataframe of `df` where all values of `col3` is `grp` (try: `df.loc[df["col3"] == "a"]`) – Corralien May 17 '21 at 07:39
  • Thank you! Great. Quick question. Currently files (here csv) are being stored in the same directory where the notebook is run. if I want to save all csv files in another directory which is a variable (like `/csv_folder`) how does `f"{grp}.csv"` change? – Sulphur May 17 '21 at 07:44
  • 1
    @Sulphur. I updated my answer according to your request. – Corralien May 17 '21 at 07:56
  • I more question, please. Does this method only work when we do one pass of the dataframe and see all unique values and create CSV? I ask this because I will be working with chunks of dataframe (like here:https://stackoverflow.com/questions/67155322/how-to-read-chunk-from-middle-of-a-long-csv-file-using-python-200-gb). So, in first pass one chunk (~=1000 rows) of df gets respective rows saved into respective csv file. When I do second pass (i.e. next ~1000 rows) in next iteration, do the same as above, will the values be appended to existing csv files if they exist, or will they be overwriten? – Sulphur May 17 '21 at 08:09
  • 1
    @Sulphur, try my solution to read csv in chunks. – Corralien May 17 '21 at 09:22
2

You can TRY:

vals = ['a','b','c']
for i in vals:
    df.loc[df.col3.eq(i)].to_csv(f"{i}.csv", index= False)

NOTE: if you wanna do this for all different values in col3, then you can use unique:

for i in df.col3.unique():
    df.loc[df.col3.eq(i)].to_csv(f"{i}.csv", index= False)
Nk03
  • 14,699
  • 2
  • 8
  • 22
0

I have one idea First, get the list of items in col_3. Then, using the list, split the data and convert it to csv. I hope it does what you want it to do.

col_3list = list(sorted(set(df['col3'])))
for i in range(len(col_3list)):
  new_df = df[df['col3']==col_3list[i]]
  csv_writer = new_df.to_csv('data_sample'+str(i)+'.csv')
sugi3
  • 11
  • 2