How to shuffle and split a large csv with headers?

Question

I am trying to find a way to shuffle the lines of a large csv files in Python and then split it into multiple csv files (assigning a number of rows for each files) but I can't manage to find a way to shuffle the large dataset, and keep the headers in each csv. It would help a lot if someone would know how to

Here's the code I found useful for splitting a csv file:

number_of_rows = 100

def write_splitted_csvs(part, lines):
    with open('mycsvhere.csv'+ str(part) +'.csv', 'w') as f_out:
        f_out.write(header)
        f_out.writelines(lines)

with open("mycsvhere.csv", "r") as f:
    count = 0
    header = f.readline()
    lines = []
    for line in f:
        count += 1
        lines.append(line)
        if count % number_of_rows == 0:
            write_splitted_csvs(count // number_of_rows, lines)
            lines = []
    
    if len(lines) > 0:
        write_splitted_csvs((count // number_of_rows) + 1, lines)

If anyone knows how to shuffle all these splitted csv this would help a lot! Thank you very much

you can use [pandas](https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows) to shuffle, and also, to write it to csv. — DanielTuzes, Feb 09 '22 at 16:59
Does this answer your question? [Shuffle DataFrame rows](https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows) — DanielTuzes, Feb 09 '22 at 16:59

score 2 · Accepted Answer · answered Feb 09 '22 at 16:58

2

I would suggest using Pandas if possible.

Shuffling rows, reset the index in place:

import pandas as pd
df = pd.read_csv('mycsvhere.csv'+ str(part) +'.csv')
df.sample(frac=1).reset_index(drop=True)

Then you can split into multiple dataframes into a list:

number_of_rows = 100
sub_dfs = [df[i:i + number_of_rows] for i in range(0, df.shape[0], number_of_rows)]

Then if you want to save the csvs locally:

for idx, sub_df in enumerate(sub_dfs):
    sub_df.to_csv(f'csv_{idx}.csv', index=False)

answered Feb 09 '22 at 16:58

Interested Developer

146
7

your answer is way better than mine ^^ But the OP custom dev need, does not match the mentionned "large" data, where custom devs where already made to be optimized :) – Gwendal Yviquel Feb 09 '22 at 17:02
Can the OP define "large csv" please ^^ - but agree if it's large GB csv files, you may want to use another method. – Interested Developer Feb 09 '22 at 17:07
Hello, I meant a datafile that could have thousands or rows but like 40MB not in GB :) so the Pandas solution works great!! Thank you very much – yoopiyo Feb 10 '22 at 09:24
Great to hear - please could you mark as the correct answer if it solved your query :) – Interested Developer Feb 10 '22 at 10:10

score 1 · Answer 2 · answered Feb 09 '22 at 16:58

There are 3 needs here :

Shuffle your dataset
Split your dataset
Formatting

For the first 2 steps, there are some nice tools in Sklearn. You can try the stratified shuffle splitter. Sklearn SSS You did not mention Stratified part, but you may need it without knowing it yet ;)

Last part, formatting, it is all up to you. You can check pandas to_csv() function where you can specify your headers, you can(need) specify your headers in the data object aswell (DataFrame). Nothing hard here, just spend a bit of time to specify what you want, and implement it easily :)

Side comments : You can drop pandas, depending on what 'big' is for you, pandas is not 'good' on big data.

How to shuffle and split a large csv with headers?

2 Answers2