2

I would like to read multiple CSV files (hundreds of files,hundreds of lines each but with the same number of columns) from a target directory into a single Python Pandas DataFrame.

The code below I wrote works but too slow.It takes minutes to run 30 files(so how long should I wait if I load all of my files). What can I alter to make it work faster?

Besides, in replacefunction, I want to replace a "_"(don't know the encoding, but not a normal one) to a "-"(normal utf-8), how can I do with that? I use coding=latin-1because I have french accents in the files.

#coding=latin-1

import pandas as pd
import glob

pd.set_option('expand_frame_repr', False)

path = r'D:\Python27\mypfe\data_test'
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0, sep = ';', dayfirst = True, 
                     parse_dates=['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective'])
    df.drop(labels=['aPaye','MethodePaiement','ArgentPercu'],axis=1,inplace=True)
    df['Sens'].replace("\n", "-", inplace=True,regex=True)
    list_.append(df)

    print "fichier lu:",file_

frame = pd.concat(list_)

print frame
ch36r5s
  • 119
  • 2
  • 13
  • Can you just concatenate all files together before you ever get to Python? That way just one file gets read? An alternative would be to `zip` the files together, which will give you random access but with the benefits of the operating system's file cache. – chrisaycock Jul 12 '16 at 13:41
  • I don't really understand how to concatenate before getting to Python. To make all the files into one csv file? And for `zip`,in which step? – ch36r5s Jul 12 '16 at 13:47
  • Yes, make one big CSV file or make a `zip` file. These have nothing to do with Python; you'll have to do this externally to Python. Once you have one file with all of the data, then have your Python script read that file directly. – chrisaycock Jul 12 '16 at 13:54
  • It sounds like a solution. But how to do that automatically? And how to make a `zip` file? I still prefer to do it inside python, since I don't know exactly how many files I have to read. – ch36r5s Jul 12 '16 at 14:04

1 Answers1

2

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
        [pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols, 
                     parse_dates=date_cols)
         for f in allFiles
        ],
        ignore_index=True
     )

this should work if you have enough memory to store two resulting DFs...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • it works for the `date_cols`, but when it comes to `usecols`, it returns an error: `ValueError: Usecols do not match names`. And if I use the number of column, it gives feedback: `IndexError: list index out of range` – ch36r5s Jul 19 '16 at 08:55
  • well, you have to specify the column names that you need in the `cols` list - i think the error message `ValueError: Usecols do not match names` is pretty self-explanatory... – MaxU - stand with Ukraine Jul 19 '16 at 10:35