16

i have multiple csv files saved in one folder with the same column layout and want to load it into python as a dataframe in pandas.

The question is really simliar to this thread.

I am using the following code:

import glob
import pandas as pd
salesdata = pd.DataFrame()
for f in glob.glob("TransactionData\Promorelevant\*.csv"):
    appenddata = pd.read_csv(f, header=None, sep=";")
    salesdata = salesdata.append(appenddata,ignore_index=True)

Is there a better solution for it with another package?

This is taking to much time.

Thanks

Hari Krishnan
  • 2,049
  • 2
  • 18
  • 29
PV8
  • 5,799
  • 7
  • 43
  • 87

4 Answers4

16

I suggest use list comprehension with concat:

import glob
import pandas as pd

files = glob.glob("TransactionData\Promorelevant*.csv")
dfs = [pd.read_csv(f, header=None, sep=";") for f in files]

salesdata = pd.concat(dfs,ignore_index=True)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • `pd.read_csv` can load data using file path, any specific reason for using `glob`? – Shiva Sep 12 '18 at 07:14
  • 1
    @Shiva - Yes, `glob` return all filepaths, so it is necessary. – jezrael Sep 12 '18 at 07:14
  • 1
    This is barely any different from the question? Though I guess `concat` can be faster than `append` (or does append use concat behind the scenes?). It could be better optimized as it is one operation I guess. – PascalVKooten Sep 12 '18 at 07:20
6

With a help from link to actual answer

This seems to be the best one liner:

import glob, os    
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "*.csv"))))
Muhammad Haseeb
  • 634
  • 5
  • 20
2

Maybe using bash will be faster:

head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
tail -q -n +2 TransactionData/Promorelevant*.csv >> merged.csv

Or if using from within a jupyter notebook

!head -n 1 "TransactionData/Promorelevant/0.csv" > merged.csv
!tail -q -n +2 "TransactionData/Promorelevant*.csv" >> merged.csv

The idea being that you won't need to parse anything.

The first command copies the header of one of the files. You can skip this line if you don't have a header. Tail skips the headers for all the files and adds them to the csv.

Appending in Python is probably more expensive.

Of course, make sure your parse is still valid using pandas.

pd.read_csv("merged.csv")

Curious to your benchmark.

gelonida
  • 5,327
  • 2
  • 23
  • 41
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • 1
    @PV8 What do you not understand? First line copies the header of one of the files to `merged.csv`, the second line appends all csvs but omitting their headers. Since it does not have to parse anything it will be lightning fast. – PascalVKooten Nov 14 '19 at 14:43
  • assuming my files in the folder are named: `0.csv`, `1.csv`and so on, and the folder path is still: `TransactionData\Promorelevant`, what do I have to write to use your code? – PV8 Nov 14 '19 at 14:54
  • 1
    @PV8 Updated the example with the folder, does it work? – PascalVKooten Nov 14 '19 at 15:05
  • the 1 is identified as invalid syntax, I hav eto run this in jupyter notebook right? – PV8 Nov 14 '19 at 15:07
  • @PV8 Updated to add quotes, the better slashes, and you have to have the `!` in front if using jupyter. The server the notebook is running on is linux based right? – PascalVKooten Nov 14 '19 at 15:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202381/discussion-between-pv8-and-pascalvkooten). – PV8 Nov 15 '19 at 07:34
0

i checked all this approaches except the bash one with the time function (only one run, and also note that the files are on a shared drive).

Here are the results:

My approach: 1220.49

List comphrension+concat: 1135.53

concat+map+join: 1116.31

I will go for list comphrension+concat which will save me some minutes and i feel quite familiar with.

Thanks for your ideas.

PV8
  • 5,799
  • 7
  • 43
  • 87