12

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0, dtype=str)

I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.

Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.

My csv file contains numerous data with '/' as the comma separator,

id    venue           time             code    value ......
AAA   Paris      28/05/2016 09:10      PAR      45   ......
111   Budapest   14/08/2016 19:00      BUD      62   ......
AAA   Tokyo      05/11/2016 23:20      TYO      56   ......
111   LA         12/12/2016 05:55      LAX      05   ......
111   New York   08/01/2016 04:25      NYC      14   ......
AAA   Sydney     04/05/2016 21:40      SYD      2    ......
ABX   HongKong   28/03/2016 17:10      HKG      5    ......
ABX   London     25/07/2016 13:02      LON      22   ......
AAA   Dubai      01/04/2016 18:45      DXB      19   ......
.
.
.
.

5 Answers5

9
import numpy as np
from multiprocessing import Pool

def processor(df):

    # Some work

    df.sort_values('id', inplace=True)
    return df

size = 8
df_split = np.array_split(df, size)

cores = 8
pool = Pool(cores)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
    frame.to_csv('{}'.format(n))
pool.close()
pool.join()
SayPy
  • 536
  • 1
  • 4
  • 13
  • 1
    Hey ! this is cool!! i was looking for something similar ! But i get this error, frame.to_csv(output, sep="^", index=False.format(n)) AttributeError: 'bool' object has no attribute 'format' –  Jul 10 '17 at 14:39
  • where, output = "/file.csv" –  Jul 10 '17 at 14:39
  • 1
    frame.to_csv(output, sep="^", index=False) – SayPy Jul 10 '17 at 14:48
  • File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next raise value IndexError: positional indexers are out-of-bounds –  Jul 10 '17 at 15:11
  • What inside of your processor function? – SayPy Jul 10 '17 at 15:12
  • i have this code : https://stackoverflow.com/questions/44966816/how-to-resolve-positional-index-error-in-python-while-solving-a-condition-in-pyt –  Jul 10 '17 at 15:12
  • frame.to_csv(output, sep="^", index=False) AttributeError: 'NoneType' object has no attribute 'to_csv' –  Jul 11 '17 at 11:25
  • @Iris: This approach will never work in case of _** I am not able to run the code as it shows memory error.**_. Every `Process` are passed `df_split` which is a List of `pandas.DataFrame`. And every `Process` needs his **own** Memory, therfore the Memory Footprint are multiplied with 8. – stovfl Jul 11 '17 at 15:02
  • you are a savior. – ak3191 Dec 17 '18 at 20:28
  • you are a savior. but it is iterating over the first column again and again how I can iterate over all the rows in parallel. ... i am trying to achieve something like :- https://stackoverflow.com/questions/53561794/iteration-over-a-pandas-df-in-parallel – ak3191 Dec 17 '18 at 20:48
6

Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv"
chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)

for it, chunk in enumerate(chunks):
    chunk.to_csv('chunk_{}.csv'.format(it), sep="/") 

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

VinceP
  • 2,058
  • 2
  • 19
  • 29
  • Wont this still consume too much memory though, since the entire dataframe is loaded before iterating an saving? – DJK Jul 14 '17 at 00:11
  • No. The whole point of chunking is that it does not load the entire dataframe into memory. The variable `chunks` in my answer is an iterable object which occupies virtually no memory (read more here)[http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking]. Only when you iterate through `chunks` you actually reading a chunk-sized version of the file into memory. – VinceP Jul 14 '17 at 08:31
5

pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to read
skiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

nitin
  • 7,234
  • 11
  • 39
  • 53
0

You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
0

If you don't need all columns you may also use usecols parameter:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

usecols : array-like or callable, default None

Return a subset of the columns. [...] 
Using this parameter results in much faster parsing time and lower memory usage.
gbajson
  • 1,531
  • 13
  • 32