Splitting csv file based on a particular column using Python

Question

I'm a Python beginner, and have made a few basic scripts. My latest challenge is to take a very large csv file (10gb+) and split it into a number of smaller files, based on the value of a particular variable in each row.

For example, the file may look like this:

Category,Title,Sales
"Books","Harry Potter",1441556
"Books","Lord of the Rings",14251154
"Series", "Breaking Bad",6246234
"Books","The Alchemist",12562166
"Movie","Inception",1573437

And I would want to split the file into separate files: Books.csv, Series.csv, Movie.csv

In reality there will be hundreds of categories, and they will not be sorted. In this case they are in the first column but in future they may not be.

I've found a few solutions online but nothing in Python. There is a really simple AWK command that can do this in one line, but I cannot get access to AWK in work.

I've written the following code which works, but I think it is probably very inefficient. Can anybody suggest how to speed it up?

import csv

#Creates empty set - this will be used to store the values that have already been used
filelist = set()

#Opens the large csv file in "read" mode
with open('//directory/largefile', 'r') as csvfile:
    
    #Read the first row of the large file and store the whole row as a string (headerstring)
    read_rows = csv.reader(csvfile)
    headerrow = next(read_rows)
    headerstring=','.join(headerrow) 
    
    for row in read_rows:
        
        #Store the whole row as a string (rowstring)
        rowstring=','.join(row)
        
        #Defines filename as the first entry in the row - This could be made dynamic so that the user inputs a column name to use
        filename = (row[0])
        
        #This basically makes sure it is not looking at the header row.
        if filename != "Category":
            
            #If the filename is not in the filelist set, add it to the list and create new csv file with header row.
            if filename not in filelist:    
                filelist.add(filename)
                with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
                    f.write(headerstring)
                    f.write("\n")
                    f.write(rowstring)
                    f.write("\n")
                    f.close()    
            #If the filename is in the filelist set, append the current row to the existing csv file.     
            else:
                with open('//directory/subfiles/' +str(filename)+'.csv','a') as f:
                    f.write(rowstring)
                    f.write("\n")
                    f.close()

Thanks!

Jon Clements · Answer 1 · 2017-10-20T11:42:00.807

7

A memory efficient way and one that avoids keep re-opening files to append here (as long as you're not going to generate huge amounts of open file handles) is to use a dict to map the category to a fileobj. Where that file isn't yet opened, then create it and write the header, then always write all rows to the corresponding file, eg:

import csv

with open('somefile.csv') as fin:    
    csvin = csv.DictReader(fin)
    # Category -> open file lookup
    outputs = {}
    for row in csvin:
        cat = row['Category']
        # Open a new file and write the header
        if cat not in outputs:
            fout = open('{}.csv'.format(cat), 'w')
            dw = csv.DictWriter(fout, fieldnames=csvin.fieldnames)
            dw.writeheader()
            outputs[cat] = fout, dw
        # Always write the row
        outputs[cat][1].writerow(row)
    # Close all the files
    for fout, _ in outputs.values():
        fout.close()

edited Oct 20 '17 at 11:42

answered Oct 20 '17 at 11:35

Jon Clements

138,671
33
247
280

Thank you. Before I saw your solution I managed to come up with something (see original post, I've corrected my code so that it now works). Is your method of checking if it's a new category or not more efficient than mine? – Actuary Oct 20 '17 at 13:11
@Actuary the check isn't necessary quicker - but the not opening/closing/reopening the file will reduce a lot of IO overhead – Jon Clements Oct 20 '17 at 14:07
Hi @JonClements, when i tried the above code I am getting a blank record for every data records in the splitted files – Smart003 Nov 04 '19 at 10:02
@Smart003 I'm guessing you're on Windows then? Try changing the file mode from `'w'` to `'wb'`... – Jon Clements Nov 04 '19 at 10:05
just a small addition, in case you're using python 3, you should avoid blank empty lines by updating the above code to: fout = open('{}.csv'.format(cat), 'w', newline='') – Idan P Mar 16 '21 at 14:37

score 6 · Answer 2 · answered Oct 07 '19 at 10:24

I was facing the same problem which made me land to this questionnaire and I was able to provide it in pandas.

Logic:

Extract all the unique items from the column you want to split upon.
Convert the array to list.
Iterate over the list using enumerate function. https://www.w3schools.com/python/ref_func_enumerate.asp

Kindly check once if this works in your case:

    import pandas as pd

    data = pd.read_csv(**filename**)

    data_category_range = data['Category'].unique()
    data_category_range = data_category_range.tolist()

    for i,value in enumerate(data_category_range):
        data[data['Category'] == value].to_csv(r'Category_'+str(value)+r'.csv',index = False, na_rep = 'N/A')

What needs to add/change if someone wants all output CSV files to the desired destination folder? — c_bfx, Jan 03 '22 at 11:26
@c_bfx `r'./desired/file/path/category_'+str(value)+r'.csv'` — Display name, Feb 07 '22 at 00:24

Display name · Answer 3 · 2022-02-07T15:31:45.150

Another option is to use the groupby function provided by Pandas for DataFrames.

for (n), group in df.groupby(['category']):
    group.to_csv(f'../desired/dir/split/{n}.csv')

here it is as a full script ready to go:

import pandas as pd

def split_csv(csvFilePath, newDirPath, splitKey):
    df = pd.read_csv(csvFilePath)
    for (n), group in df.groupby([splitKey]):
        group.to_csv(f'{newDirPath}{n}.csv')

# https://stackoverflow.com/questions/419163/what-does-if-name-main-do
if __name__ == "__main__":
    split_csv(
        csvFilePath='../desired/dir/myfile.csv', # CHANGE ME
        newDirPath='../desired/dir/myfile/', # CHANGE ME
        splitKey='category'  # CHANGE ME
    )

Splitting csv file based on a particular column using Python

3 Answers3

Linked