1

I am trying to parse a huge csv file (around 50 million rows) using Pandas 'read_csv' method.

Below is the code snippet I am using:

df_chunk = pd.read_csv(db_export_file, delimiter='~!#', engine='python', header=None, keep_default_na=False, na_values=[''], chunksize=10 ** 6, iterator=True)

Therafter using the pd.concat method I am getting the whole set of dataframe which is used for further processing.

Everything is working fine instead, the read operation from that csv file takes almost 6 mins to create the dataframe.

My question is that, is there any other way to make this process faster using the same module and method?

Below is the sample data presented as a csv file

155487~!#-64721487465~!#A1_NUM~!#1.000 155487~!#-45875722734~!#A32_ENG~!#This is a good facility 458448~!#-14588001153~!#T12_Timing~!#23-02-2015 14:50:30 458448~!#-10741214586~!#Q2_56!#

Thanks in advance

  • We would need more information, e.g. what types are the columns, do you have repeated string data, are they properly typed (e.g. no mixed numeric & non-numeric data), etc. Showing us a sample would help. – jpp Jul 30 '18 at 09:21
  • Hi, I have updated with a sample file data, actually its a mixed type. – Sourajit Roy Chowdhury Jul 30 '18 at 09:33
  • 1
    I think your best choice is [split the csv](http://man7.org/linux/man-pages/man1/split.1.html) and then read all chunks using multiprocessing: https://stackoverflow.com/questions/36587211/easiest-way-to-read-csv-files-with-multiprocessing-in-pandas – BFajardo Jul 30 '18 at 09:28

2 Answers2

4

I think your best choice is split the csv

split -l LINES_PER_FILE YOUR.CSV OUTPUT_NAMES

and then read all chunks using multiprocessing. You have an example here:

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():
    # set up your pool
    pool = Pool(processes=8) # or whatever your hardware can support

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # have your pool map the file names to dataframes
    df_list = pool.map(read_csv, file_list)

    # reduce the list of dataframes to a single dataframe
    combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()
BFajardo
  • 56
  • 3
0

My case and how it was solved

I had a similarly huge dataset and a custom converter was mandatory to be implemented. The pandas.read_csv() was taking ages because of the custom converter.

The solution for me was to use modin. It was simple, just had to change the import on top and everything else was done automatically.

Take a look at the page: https://github.com/modin-project/modin

Kots
  • 486
  • 1
  • 5
  • 21