48

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.

If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.

Is there a convenient function for this? I've looked around but found nothing useful...

i.e. split_dataframe(df, 2(if > 10))?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Boosted_d16
  • 13,340
  • 35
  • 98
  • 158
  • 1
    Im building a PowerPoint deck and pasting the data from the df to a chart on each slide, and charts with more then 10 rows become unreadable. – Boosted_d16 Aug 13 '14 at 15:56

9 Answers9

44

I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:

size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]

or as generator:

list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))
mEEGal
  • 44
  • 8
agittarius
  • 619
  • 6
  • 10
  • 3
    I like this solution. However, I think you want "df.iloc[i:i+size]" as otherwise you are missing one row per block and also it will work with non-integer indices. – Scott Talbert Aug 09 '19 at 18:02
  • 2
    @ScottTalbert: No this works fine, no need to change to i+size. In .loc both are inclusive unlike subsetting of lists etc – MNA Oct 10 '19 at 09:41
  • one-line.. love it! – Nikhil VJ Jun 05 '20 at 12:53
  • 1
    great solution. hint for potential users: ``df.reset_index(inplace=True)`` prior to running it. else, potential non-unique label errors – Quetzalcoatl Jun 15 '22 at 23:07
28

This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.

df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None

Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.

Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.

But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.

ely
  • 74,674
  • 34
  • 147
  • 228
  • 1
    df[:10] works fine but when I try df[:10, :] I got this error: TypeError: unhashable type – Luis Ramon Ramirez Rodriguez Jan 19 '16 at 03:29
  • @LuisRamonRamirezRodriguez That sounds like an unrelated error to this question, You should post a new question and show how the DataFrame you are using is created and what some of the data within it looks like. – ely Jan 19 '16 at 13:23
  • 4
    @LuisRamonRamirezRodriguez: use df.loc[:10,:] instead – nbeuchat Oct 23 '16 at 16:55
  • 4
    @nbeuchat That is not actually good advice. [loc](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) is intentionally a *label*-based indexing, so what is passed to it is *never* interpreted as an integer position. This would only work "accidentally" if your index label happened to be a range of integers without gaps and starting from 0. Rather, using `iloc` is preferred, possibly also `.ix` if careful to force the integer position behavior. Since regular `getitem` syntax does this already, that's why it's OK. – ely Oct 25 '16 at 17:22
21

There is no specific convenience function.

You'd have to do something like:

first_ten = pd.DataFrame()
rest = pd.DataFrame()

if df.shape[0] > 10: # len(df) > 10 would also work
    first_ten = df[:10]
    rest = df[10:]
EdChum
  • 376,765
  • 198
  • 813
  • 562
6

A method based on np.split:

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]

A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).

(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)

webelo
  • 1,646
  • 1
  • 14
  • 32
4

Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:

import pandas as pd

def split_dataframe_to_chunks(df, n):
    df_len = len(df)
    count = 0
    dfs = []

    while True:
        if count > df_len-1:
            break

        start = count
        count += n
        #print("%s : %s" % (start, count))
        dfs.append(df.iloc[start : count])
    return dfs


# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])

# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]

# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]
Roei Bahumi
  • 3,433
  • 2
  • 20
  • 19
3

If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:

max_rows = 4500
dataframes = []
while len(df) > max_rows:
    top = df[:max_rows]
    dataframes.append(top)
    df = df[max_rows:]
else:
    dataframes.append(df)

You could then save out these data frames:

for _, frame in enumerate(dataframes):
    frame.to_csv(str(_)+'.csv', index=False)

Hope this helps someone!

cheevahagadog
  • 4,638
  • 3
  • 15
  • 15
3
def split_and_save_df(df, name, size, output_dir):
    """
    Split a df and save each chunk in a different csv file.

    Parameters:
        df : pandas df to be splitted
        name : name to give to the output file
        size : chunk size
        output_dir : directory where to write the divided df
    """
    import os
    for i in range(0, df.shape[0],size):
        start  = i
        end    = min(i+size-1, df.shape[0]) 
        subset = df.loc[start:end] 
        output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
        print(f"Going to write into {output_path}")
        subset.to_csv(output_path)
        output_size = os.stat(output_path).st_size
        print(f"Wrote {output_size} bytes")
Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
2

You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10

def split(df, headSize) :
    hd = df.head(headSize)
    tl = df.tail(len(df)-headSize)
    return hd, tl

df = pd.DataFrame({    'A':[2,4,6,8,10,2,4,6,8,10],
                       'B':[10,-10,0,20,-10,10,-10,0,20,-10],
                       'C':[4,12,8,0,0,4,12,8,0,0],
                      'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})

# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)
Tom Walker
  • 837
  • 1
  • 8
  • 12
0

The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.

Example:

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name
Oerd
  • 2,256
  • 1
  • 21
  • 35
Ram Prajapati
  • 1,901
  • 1
  • 10
  • 8