2

Trying to update each row from df1 to df2 if an unique value is matched. If not, append the row to df2 and assign new ID column.

df1 (NO ID COLUMN):

        unique_value        Status        Price
0       xyz123              bad           6.67
1       eff987              bad           1.75
2       efg125              okay          5.77

df2:

        unique_value        Status        Price        ID
0       xyz123              good          1.25         1000
1       xyz123              good          1.25         1000
2       xyz123              good          1.25         1000
3       xyz123              good          1.25         1000
4       xyz985              bad           1.31         1001
5       abc987              okay          4.56         1002
6       eff987              good          9.85         1003
7       asd541              excellent     8.85         1004

Desired output for updated df2:

        unique_value        Status        Price        ID
0       xyz123              bad           6.67         1000     <-updated
1       xyz123              bad           6.67         1000     <-updated
2       xyz123              bad           6.67         1000     <-updated
3       xyz123              bad           6.67         1000     <-updated
4       xyz985              bad           1.31         1001
5       abc987              okay          4.56         1002
6       eff987              bad           1.75         1003     <-updated
7       asd541              excellent     8.85         1004
8       efg125              okay          5.77         1005     <-appended  

Here is what I have done so far:

for i in range(0, len(df1)):
    if df1['unique_value'].isin(df2['unique_value'])[i] == True:
        ... update row in df2
    else:
        df2 = df2.append(i)
        ... assign row with new ID using pd.factorize and ID value at df2['ID'].max()+1

Note that I initial used pd.factorize to assign ID based on unique_value for df2 with values starting at 1000, 1001 (and so on) using this code: df2['ID'] = pd.factorize(df2['unique_value'])[0] + 1000

I tried using this solution (Updating a dataframe rows based on another dataframe rows), however it indexes my unique_value column, which prevents me from iterating another dataset moving forward.

Any way we can script this?

Thanks!

Mick
  • 265
  • 2
  • 10

1 Answers1

1

My strategies of implementing the two parts is explained as follows.

  1. Update existing rows: df2 can be updated via broadcasting, provided that the shape of the row from df1 is correctly reshaped into (1, 3). The broadcasting concept in pandas is identical to that of numpy.
  2. Append new rows: Assuming a consecutive index counting up from 0, a new row can be easily appended by directly calling df2.loc[len(df2), :] = ..., where len(df2) is the next unused natural number for the index column. Example: this answer.

In addition, 2 additional state variables are constructed in my solution, as I think they would be more efficient than having to search through the entire df2 every time. They can of course be discarded if this is not a problem.

Code:

# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)

# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)

for i in range(len(df1)):
    
    # current unique_value from df1
    uv1 = df1["unique_value"][i]
    
    # 1. update existing
    if mask[i]:
        
        # broadcast df1 into the matched rows in df2 (mind the shape)
        df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
        
    # 2. append new
    else:
        # update state variables
        current_max_id += 1
        current_value_set.add(uv1)
        # append the row (assumes df2.index=[0,1,2,3,...])
        df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]

Output:

df2
Out[45]: 
  unique_value     Status  Price      ID
0       xyz123        bad   6.67  1000.0
1       xyz123        bad   6.67  1000.0
2       xyz123        bad   6.67  1000.0
3       xyz123        bad   6.67  1000.0
4       xyz985        bad   1.31  1001.0
5       abc987       okay   4.56  1002.0
6       eff987        bad   1.75  1003.0
7       asd541  excellent   8.85  1004.0
8       efg125       okay   5.77  1005.0

Tested with python 3.7, pandas 1.1.2, OS=debian 10 64-bit

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
  • Thanks @Bill! Quick question, why is .reshape necessary when broadcasting df1 into the matched row in df2? – Mick Oct 05 '20 at 02:07
  • My guess: If the main array happens to have shape (3, 3), and you provide an array of shape (3, ). Then which dimension are you going to broadcast? As (3, 1) or (1, 3)? Therefore, I myself always assign singleton dimension explicitly whenever there could be such a pitfall. – Bill Huang Oct 05 '20 at 02:43
  • Thanks again @Bill. Any idea why the ID column returned as a ```float```? – Mick Oct 05 '20 at 03:19
  • That's just because I forgot to re-format the input ID as I copy-paste. `df2["ID"] = df2["ID"].astype(int)` in the beginning solves the problem, but you don't need this in your actual workflow because it is caused by re-importing of the text data on SO. – Bill Huang Oct 05 '20 at 03:52