3

Any ideas on the limit of rows to use the Numpy array_split method?

I have a dataframe with +6m rows and would like to split it in 20 or so chunks.

My attempt followed that described in: Split a large pandas dataframe

using Numpy and the array_split function, however being a very large dataframe it just goes on forever.

My dataframe is df which includes 8 columns and 6.6 million rows.

df_split = np.array_split(df,20)

Any ideas on an alternative method to split this? Alternatively tips to improve dataframe performance are also welcomed.

Al Imran
  • 882
  • 7
  • 29
Jo Costa
  • 421
  • 1
  • 6
  • 17
  • How should the array be? how many rows per array? Maybe you could do it inside a loop or list comprehension. Also change for `(df.values,20)` and see if it is faster – mrbTT Jan 29 '19 at 11:44
  • Have you also tried [this](https://stackoverflow.com/a/28882020/10960882)? – markuscosinus Jan 29 '19 at 11:47
  • Given that is 6.6 million rows, the n = 20 split would mean circa 330k rows per array. However that split is a bit arbitrary, can be 10, 20 or 100, my focus is in terms of performance. – Jo Costa Jan 29 '19 at 11:47
  • Thank you @markuscosinus. Yes, did it but didn't seemed to work either, so reverted to the original method. – Jo Costa Jan 29 '19 at 11:48
  • Have you tried just using plain old `.iloc`, then maybe saving each view to a file and working with each file individually? – Tarifazo Jan 29 '19 at 12:49
  • What do you do after your split? if you are performing similar calculations on each section, consider `groupby` with `dask.dataframe`. – jpp Jan 29 '19 at 12:55

3 Answers3

1

I do not have a general solution, however there are two things you could consider:

  1. You could try loading the data in chunks, instead of loading it and then splitting it. If you use pandas.read_csv the skiprows argument would be the way to go.
  2. You could reshape your data with df.values.reshape((20,-1,8)). However this would require the number of rows to be divisible by 20. You could consider not using the last (a maximum of 19) of the samples to make it fit. This would of course be the fastest solution.
markuscosinus
  • 2,248
  • 1
  • 8
  • 19
1

Maybe this resolve your problem by separating the dataframe to chunk like this example:

import numpy as np
import pandas as pds

df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

for i in chunker(df,5):
    df_split = np.array_split(i, 20)
    print(df_split)
Kurollo
  • 65
  • 6
  • Thank you, so far looks like this solution will help. Part of my problem was to generate these in order to import them onto bigquery. Looks like this solution will allow to break the data properly and inserting it. – Jo Costa Jan 29 '19 at 14:27
  • if you mean insert data to table you can using sqlalchemy also concatenate all value of chunk inside same query sql and loop to each chunks, this best solution to improve performance example: sqlinsert = "INSERT INTO table (col_1, col_2, col_3) VALUES ("+your_bid_data_implode+")" – Kurollo Jan 29 '19 at 15:01
0

With a little modifications on the code of Houssem Maamria, this file could help someone trying to export each chunk to an excel file.


import pandas as pd
import numpy as np

dfLista_90 = pd.read_excel('my_excel.xlsx', index_col = 0) # to include the headers

count = 0
limit = 200
rows = len(dfLista_90)
partition = (rows // limit) + 1

def chunker(df, size):
    return (df[pos:pos + size] for pos in range(0, len(df), size))

for a in chunker(dfLista_90, limit):
    to_excel = np.array_split(a, partition)
    count += 1
    a.to_excel('file_{:02d}.xlsx'.format(count), index=True)
Yes C.
  • 1
  • 1
  • 1
  • Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). This provided answer may be correct, but it could benefit from an explanation. Code only answers are not considered "good" answers. From [review](https://stackoverflow.com/review). – Trenton McKinney Sep 26 '19 at 18:18