1

I'm passing chunks of 1,000 rows and merging them within a pandas dataframe but the merged dataframe isn't saved and there's a new datagframe output each time. How can I save the merge each iteration?

def mergeDFs:
    def merge(x):
        df = df.merge(x, left_on="id", right_on="id")


    reader = pd.read_csv("train_lag.csv", chunksize=1000)

    for r in reader:
        merged = merge(r)
    return merged

1 Answers1

0

Consider concat via a list comprehension:

def proc_merge(x):
  return df.merge(x, on="id")

reader = pd.read_csv("train_lag.csv", chunksize=1000)

final_df = pd.concat([proc_merge(r) for r in reader])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This worked, but returning final_df returned more rows than either had originally. – John Milton Oct 26 '19 at 10:31
  • That might mean your IDs are not unique in either data frame so many-to-many merge is occurring. Any additional field to consider? Try also concatenating first then merging entire `train_lag` dataframe with `df`. – Parfait Oct 26 '19 at 16:32
  • You were absolutely right, thank you lots. Rearranging the Id field fixed my issue – John Milton Oct 26 '19 at 18:41