2

About 2 years ago someone had a very elegant way of reading multiple csv files into one dataframe: Import multiple csv files into pandas and concatenate into one DataFrame

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

But what if you want a different separator or your csv files don’t have headers? Where do you put arguments like header = None in the above statements?

  • You could use `functools.partial` or wrap `read_csv` in a lambda function, like so `lambda x: pd.read_csv(x, header=None)`. Also worth understanding what `map` does – NomadMonad Apr 20 '20 at 19:51
  • list comprehension also works: `df = pd.concat([pd.read_csv(f, header=None) for f in filepaths])` – It_is_Chris Apr 20 '20 at 20:11
  • @Yo_Chris OK, but I read somewhere that using a list has a big impact on memory usage. Is that true or am I mistaken? – SBurggraaff Apr 21 '20 at 08:16
  • @SBurggraaff Lists can have a big impact on memory but that is not always the case especially with list comprehension: https://stackoverflow.com/questions/1247486/list-comprehension-vs-map. Also I just timed the difference between `map` and list comprehension on 6000 csv files and list comprehension actually was faster. List comprehension: `16 s ± 532 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` Map: `16.2 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – It_is_Chris Apr 21 '20 at 13:00
  • @Yo_Chris Ok, again thanks for all the info. It's back to using lists again. I looked up some two year old code I made from examples I found. Turns out I was adding dataframes to a list and then using that list in the pd.concat function. In hindsight that looks pretty stupid when your dataframes get really big. List comprehension looks really nice and clean by the way. – SBurggraaff Apr 21 '20 at 14:08

1 Answers1

0

You can use itertools.starmap.

This function takes:

  • a function as the first argument,
  • and runs it on each set of parameters from an iterable (second argument - a list of tuples).

I ran the following example:

import itertools as it

# read_csv wrapper
def rd_csv(fn, separ, hdr):
    print(f'File: {fn} / sep: {separ} / header: {hdr}')
    if hdr is None:
        return pd.read_csv(fn, sep=separ,
            names=['ind', 'POLL_X', 'POLL_Y', 'POLL_Z', 'POLL_DNW', 'AVal', 'ZVal'])
    else:
        return pd.read_csv(fn, sep=separ, header=hdr)

# Parameters for consecutive calls (filename, separator, header)
inputs = [ ('Input_1.csv', ',', 0), ('Input_2.csv', ';', None) ]

# Read all files
res = pd.concat(it.starmap(rd_csv, inputs), ignore_index=True)

Test printouts during exacutions are:

File: Input_1.csv / sep: , / header: 0
File: Input_2.csv / sep: ; / header: None

(drop them in the target version).

Note a "special treatment" of the case when header == None.

The reason is that:

  • DataFrames read with header=0 will have column names read from the first line.
  • DataFrames read with header=None will have column names set to consecufive numbers,

so on concatenation these columns will not be concatenated properly.

To take it into accouunt, when header == None, the wrapper function must specify column names (names parameter) on its own (so it must include hard-coded column names).

The reationale behind this combination is that if all these source files are to be concatenated into one DataFrame, they should have a common set of column names, even if some input files do not actually contain them in the first row.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Ok, but that kind of defeats the original point: two clean lines of code. You end up with way more code if you start using a wrapper. – SBurggraaff Apr 21 '20 at 08:19
  • Map is an elegant solution, but here there are more parameters. Another important detail is that with *header=None* column names are given as integers, so the resulting DataFrames can not be concatenated with respective columns passing together. To circumvent this detail, the wrapper must on its own pass the list of column names. Maybe someone else will come up with a better solution. – Valdi_Bo Apr 21 '20 at 08:53
  • Ok, but then isn't it pointless to use the map function? Wouldn't it be better to go back to using a list? Only, I thought I read somewhere that using a list to concatenate dataframes uses a lot of memory. So I was wondering about avoiding the list. – SBurggraaff Apr 21 '20 at 09:40