Save pandas merge from within a loop

Question

I'm passing chunks of 1,000 rows and merging them within a pandas dataframe but the merged dataframe isn't saved and there's a new datagframe output each time. How can I save the merge each iteration?

def mergeDFs:
    def merge(x):
        df = df.merge(x, left_on="id", right_on="id")


    reader = pd.read_csv("train_lag.csv", chunksize=1000)

    for r in reader:
        merged = merge(r)
    return merged

use .append to eppend them to an array that is outside the loop — Yepram Yeransian, Oct 25 '19 at 22:13
@abhilb sorry, it was meant to be merge (r), I was copying from phone so misread it — John Milton, Oct 25 '19 at 22:58

score 0 · Accepted Answer · answered Oct 25 '19 at 22:45

0

Consider concat via a list comprehension:

def proc_merge(x):
  return df.merge(x, on="id")

reader = pd.read_csv("train_lag.csv", chunksize=1000)

final_df = pd.concat([proc_merge(r) for r in reader])

answered Oct 25 '19 at 22:45

Parfait

104,375
17
94
125

This worked, but returning final_df returned more rows than either had originally. – John Milton Oct 26 '19 at 10:31
That might mean your IDs are not unique in either data frame so many-to-many merge is occurring. Any additional field to consider? Try also concatenating first then merging entire `train_lag` dataframe with `df`. – Parfait Oct 26 '19 at 16:32
You were absolutely right, thank you lots. Rearranging the Id field fixed my issue – John Milton Oct 26 '19 at 18:41

Save pandas merge from within a loop

1 Answers1