1

I have the following code which recursively iterates over a directory containing thousands of csv's, and attempts to read and add them all to one DataFrame:

df = pd.DataFrame()
symbol = symbol.upper()

for filepath in glob.iglob(r'W:\data\{0}\option\**\**\**.csv'.format(188), recursive=True):

    optionNameCSI = filepath.split("\\")[-1].split('.')[0]
    try:
        tmp = pd.read_csv(filepath, engine='c')
        strike = tmp['Strike'].iloc[-1]
        expiry = pd.to_datetime(tmp['Option Expiration Date'].iloc[-1])
        m = expiry.month
        y = expiry.year
        PutCall = tmp['PutCall'].iloc[-1]
        future = symbol + numToLetter[m] + str(y)
    except (IndexError, KeyError) as e:
        continue

    if tmp.empty:
        df = tmp
    else:
        df = df.append(tmp)

    print(optionName, 'loaded')

However, this code starts off iterating very quickly, then slows down exponentially and never completes. Is there something I'm doing wrong? I know that the file paths are all acquired correctly, so it's the growing of the DataFrame that is the issue.

Évariste Galois
  • 1,043
  • 2
  • 13
  • 27
  • 5
    You are appending to a DataFrame within a loop, which needlessly copies data and is extremely inefficient (which is why is starts out fine, but then slows to a halt). Append to a list within the loop and concatenate once after. – ALollz Nov 19 '18 at 14:40
  • 2
    [Alexander's Solution](https://stackoverflow.com/a/37009561/4333359) illustrates this. – ALollz Nov 19 '18 at 14:42
  • 2
    Another illustration from @unutbu with wise words: [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Nov 19 '18 at 14:48
  • Nice! I was not aware of this. If we're trying to be as efficient as possible, is there any significant difference in the performance of Alexander's solution compared to the concatenation? – Évariste Galois Nov 19 '18 at 14:53
  • 1
    I'd just use the `csv` module tbh. Every attempt I've made at growing a DF like this has been crippling in speed and memory. I have not found a hack to get around it. The odd join or concat, maybe, but the overhead is gross once you put it in a loop. – roganjosh Nov 19 '18 at 14:55

1 Answers1

3

consider separating your code into separate functions like so:

def get_data_from_csv(filepath):
    optionNameCSI = filepath.split("\\")[-1].split('.')[0]
    try:
        df = pd.read_csv(filepath, engine='c')
        # do stuff ...
        return df
    except (IndexError, KeyError) as e:
        return

then you can use a list comprehension to gather all the data in a list like people above have suggested

filepaths = glob.iglob(r'W:\data\{0}\option\**\**\**.csv'.format(188), recursive=True)
result = [get_data_from_csv(filepath) for filepath in filepaths]
result = [r for r in result if r is not None] # remove 'None' values

then join the data using pd.concat

df = pd.concat(result)
robertwest
  • 904
  • 7
  • 13