Updating/updating a data table using python

Question

I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:


# Inserting and updating already existing data

# Original data

import pandas as pd

source_data = {'Customer Number':  ['1', '2', '3'],
        'Colour': ['Red', 'Blue', 'Green'],
         'Flow': ['Good', 'Bad', "Good"]
        }

df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])

print(df1)

# New data

new_data = {'Customer Number':  ['1', '4',],
        'Colour': ['Blue', 'Blue'],
         'Flow': ['Bad', 'Bad']
        }

df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])

print(df2)

# What the updated table will look like

updated_data = {'Customer Number':  ['1', '2', '3', '4',],
        'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
         'Flow': ['Bad', 'Bad', "Good", 'Bad']
        }

df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])

print(df3)

What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.

Does anyone know where I should start with this? Which module I could use?

I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.

Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.

Thanks

You will have a lot more luck with a small example demonstrating the problem. — tdelaney, Mar 15 '20 at 15:21
@tdelaney I have updated with a working example, thanks for the advice — Mrmoleje, Mar 15 '20 at 16:52
personally i can't tell what you are asking. you already have updating and inserting examples, but that's what you seem to be asking for? — Dave Kielpinski, Mar 17 '20 at 19:59

score 1 · Answer 1 · answered Mar 17 '20 at 20:19

Current data frame structure and 'pd.update'

With some preparation, you can use the pandas 'update' function. First, the data frames must be indexed (this is often useful anyway). Second, the source data frame must be extended by the new indices with dummy/NaN data so that it can be updated.

# set indices of original data frames
col = 'Customer Number'
df1.set_index(col, inplace=True)
df2.set_index(col, inplace=True)
df3.set_index(col, inplace=True)

# extend source data frame by new customer indices
df4 = df1.copy().reindex(index=df1.index.union(df2.index))

# update data
df4.update(df2)

# verify that new approach yields correct results
assert all(df3 == df4)

Current data frame structure and 'pd.concat'

A slightly easier approach joins the data frames and removes duplicate rows (and sorts by index if wanted). However, the temporary concatenation requires more memory which may limit the size of the data frames.

df5 = pd.concat([df1, df2])
df5 = df5.loc[~df5.index.duplicated(keep='last')].sort_index()
assert all(df3 == df5)

Alternative data structure

Given that 'Customer Number' is the crucial attribute of your data, you may also consider restructuring your original dictionaries like that:

{'1': ['Red', 'Good'], '2': ['Blue', 'Bad'], '3': ['Green', 'Good']}

Then updating your data simply corresponds to (re)setting the key of the source data with the new data. Typically, working directly on dictionaries is faster than using data frames.

# define function to restructure data, for demonstration purposes only
def restructure(data):
    # transpose original data
    # https://stackoverflow.com/a/6473724/5350621
    vals = data.values()
    rows = list(map(list, zip(*vals)))
    # create new restructured dictionary with customers as keys
    restructured = dict()
    for row in rows:
        restructured[row[0]] = row[1:]
    return restructured

# restructure data
source_restructured = restructure(source_data)
new_restructured = restructure(new_data)

# simply (re)set new keys
final_restructured = source_restructured.copy()
for key, val in new_restructured.items():
    final_restructured[key] = val

# convert to data frame and check results
df6 = pd.DataFrame(final_restructured, index=['Colour', 'Flow']).T
assert all(df3 == df6)

PS: When setting 'df1 = pd.DataFrame(source_data, columns=[...])' you do not need the 'columns' argument because your dictionaries are nicely named and the keys are automatically taken as column names.

score 1 · Answer 2 · answered Mar 17 '20 at 20:22

You can use set intersection to find the Customer Numbers to update and set difference to find new Customer Number to add.

Then you can first update the initial data frame rows iterating through the intersection of Costumer Number and then merge the initial data frame only with the new rows of the data frame with the new values.

# same name column for clarity  
cn = 'Customer Number'

# convert Consumer Number values into integer to use set
CusNum_df1 = [int(x) for x in df1[cn].values]
CusNum_df2 = [int(x) for x in df2[cn].values]

# find Customer Numbers to update and to add
CusNum_to_update = list(set(CusNum_df1).intersection(set(CusNum_df2)))
CusNum_to_add = list(set(CusNum_df2) - set(CusNum_df1))

# update rows in initial data frame 
for num in CusNum_to_update:
    index_initial = df1.loc[df1[cn]==str(num)].index[0]
    index_new = df2.loc[df2[cn]==str(num)].index[0]
    for col in df1.columns:
        df1.at[index_initial,col]= df2.loc[index_new,col]

# concatenate new rows to initial data frame 
for num in CusNum_to_add:
    df1 = pd.concat([df1, df2.loc[df2[cn]==str(num)]]).reset_index(drop=True)


out:
      Customer Number Colour  Flow
0               1   Blue   Bad
1               2   Blue   Bad
2               3  Green  Good
3               4   Blue   Bad

score 1 · Answer 3 · answered Mar 24 '20 at 02:50

There are many ways, but in terms of readability, I would prefer to do this.

import pandas as pd

dict_source = {'Customer Number': ['1', '2', '3'],
               'Colour': ['Red', 'Blue', 'Green'],
               'Flow': ['Good', 'Bad', "Good"]
               }

df_origin = pd.DataFrame.from_dict(dict_source)

dict_new = {'Customer Number': ['1', '4', ],
            'Colour': ['Blue', 'Blue'],
            'Flow': ['Bad', 'Bad']
            }

df_new = pd.DataFrame.from_dict(dict_new)

df_result = df_origin.copy()
df_result.set_index(['Customer Number', ], inplace=True)
df_new.set_index(['Customer Number', ], inplace=True)
df_result.update(df_new)  # update number 1

# handle number 4
df_result.reset_index(['Customer Number', ], inplace=True)
df_new.reset_index(['Customer Number', ], inplace=True)
df_result = df_result.merge(df_new, on=list(df_result), how='outer')
print(df_result)

  Customer Number Colour  Flow
0               1   Blue   Bad
1               2   Blue   Bad
2               3  Green  Good
3               4   Blue   Bad

score 0 · Answer 4 · answered Mar 17 '20 at 20:14

You can use 'Customer Number' as index and use update method:

import pandas as pd

source_data = {'Customer Number':  ['1', '2', '3'],
        'Colour': ['Red', 'Blue', 'Green'],
         'Flow': ['Good', 'Bad', "Good"]
        }

df1 = pd.DataFrame (source_data, index=source_data['Customer Number'], columns=['Colour', 'Flow'])

print(df1)

# New data

new_data = {'Customer Number':  ['1', '4',],
        'Colour': ['Blue', 'Blue'],
         'Flow': ['Bad', 'Bad']
        }

df2 = pd.DataFrame (new_data, index=new_data['Customer Number'], columns=['Colour', 'Flow'])

print(df2)

df3 = df1.reindex(index=df1.index.union(df2.index))

df3.update(df2)

print(df3)

  Colour  Flow
1   Blue   Bad
2   Blue   Bad
3  Green  Good
4   Blue   Bad

Updating/updating a data table using python

4 Answers4

Current data frame structure and 'pd.update'

Current data frame structure and 'pd.concat'

Alternative data structure