1

I'm trying to read, merge and append a large amount of from a CSV. The basics are all working correctly. However, I am overwriting my result set and haven't been able to correct it.

Data in both files is very straightforward:

# Small data set
A,B,C
1,2,101
3,4,102
9,10,103

# Large data set(used in chunk below)
A,B,C
1,2,1000
3,4,2000
9,10,3000

Sample Code

import pandas as pd

# Read CSVs
inventory_1 = pd.read_csv("file1.csv")

# Create new DF to hold the merge results
bucket = pd.DataFrame(columns=list("ABC"))

# Use chunk to read in the large file, merge and append the data
for chunk in pd.read_csv("file2.csv",chunksize=2):
    chunk_merge = pd.merge(
        inventory_1, chunk,
        left_on=['A'],
        right_on=['A'],
        how='left')
    result = bucket.append(chunk_merge)
    print(result)

What happens is the merge will work correctly on the data in the chunk, but the previous results are overwritten in the result. So, in the example above I get:

# 1st Loop
  A    B   B_x  B_y    C     C_x    C_y
0  1  NaN   2.0  2.0  NaN  1000.0  101.0
1  3  NaN   4.0  4.0  NaN  2000.0  102.0
2  9  NaN  10.0  NaN  NaN  3000.0    NaN

# 2nd Loop
   A    B   B_x   B_y    C     C_x    C_y
0  1  NaN   2.0   NaN  NaN  1000.0    NaN
1  3  NaN   4.0   NaN  NaN  2000.0    NaN
2  9  NaN  10.0  10.0  NaN  3000.0  103.0

The answer I need is:

   A  B_x   C_x  B_y  C_y
0  1    2  1000    2  101
1  3    4  2000    4  102
2  9   10  3000   10  103

I feel like the answer is staring me in the face but I can't see it. Any help would be appreciated.

FunnyChef
  • 1,880
  • 3
  • 18
  • 30
  • `df1.merge(df2, on='A')` – piRSquared Jun 18 '18 at 20:19
  • 2
    For the overwritting, the problem comes from you use `append` on `bucket` which is an empty dataframe and stay always empty. `Append` on a df does not work as on a `list` so actually the `chunk_merge` of the first loop is lost when you reassign `result` at the second loop, because bucket never contains the value of `chunk_merge` – Ben.T Jun 18 '18 at 20:24
  • 2
    [**Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying**](https://stackoverflow.com/a/36489724/1422451) – Parfait Jun 18 '18 at 20:45

2 Answers2

0

As I said in a comment, the problem of overwritting comes from using append on a dataframe the way you did, the data is lost when you reassign result. With the sample you give, you can append chunk_merge into a list at each loop and then use pd.concat.

inventory_1 = pd.read_csv("file1.csv")
list_to_concat = [] #empty list you will append with chunk_merge 
for chunk in pd.read_csv("file2.csv",chunksize=2):
    list_to_concat.append( pd.merge(
        inventory_1, chunk,
        on='A', #simple on as both column have the same name
        how='inner')) # this will help for concat, if you want to keep left, then a dropna is necessary
result = pd.concat(list_to_concat) #add .dropna() if left above

With your data, I artificially chunked your "large data set" into a df of 2 rows and another one of 1 row to recreate the idea, and at the end, I get:

result
Out[366]: 
   A  B_x  C_x  B_y   C_y
0  1    2  101    2  1000
1  3    4  102    4  2000
0  9   10  103   10  3000

Note that C_x and C_y are exchange (B too but you don't see with your data), as you merge first on inventory_1, but otherwise it's what you want

Ben.T
  • 29,160
  • 6
  • 32
  • 54
0
>>> df1=pd.DataFrame({'A': [1,3,9], 'B': [2,4,10], 'C': [101,102,103]})
>>> df2=pd.DataFrame({'A': [1,3,9], 'B': [2,4,10], 'C': [1000, 2000, 3000]})
>>> 
>>> df2.merge(df1, on='A')
   A  B_x   C_x  B_y  C_y
0  1    2  1000    2  101
1  3    4  2000    4  102
2  9   10  3000   10  103
>>> 
Sunitha
  • 11,777
  • 2
  • 20
  • 23