Hei,
I want to combine data frames (incidentally with a multi-index), into a larger data frame. Sometimes the data needs to be appended (new rows or columns are being added), sometime existing data needs to be updated. Somehow, I cant find a way to do both. It is either append (using .append()) or some sort of updating (.merge(), .update()) I tried to search for this and read the documentation, but cant really figure it out.
This is a test code
import pandas as pd
import numpy as np
zones = ['A', 'B', 'C']
# input data frames
dates0 = pd.date_range('20180101', '20180102', freq='D')
dates1 = pd.date_range('20180103', '20180104', freq='D')
idx00 = pd.MultiIndex.from_product(iterables=[dates0, [zones[0]]], names= ['UTC', 'zone'])
df00 = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[1, 2], [3, 4]])
idx01 = pd.MultiIndex.from_product(iterables=[dates1, [zones[0]]], names=['UTC', 'zone'])
df01 = pd.DataFrame(index=idx01, columns=['a', 'b'], data=[[5, 6], [7, 8]])
idx10 = pd.MultiIndex.from_product(iterables=[dates0, [zones[1]]], names=['UTC', 'zone'])
df10 = pd.DataFrame(index=idx10, columns=['b', 'c'], data=np.random.rand(2, 2))
idx11 = pd.MultiIndex.from_product(iterables=[dates1, [zones[1]]], names=['UTC', 'zone'])
df11 = pd.DataFrame(index=idx11, columns=['b', 'c'], data=np.random.rand(2, 2))
# append - works, but only if the data is not yet there
df_append = df00.append(df01)
df_append = df_append.append(df10)
df_append = df_append.append(df11)
df_append.sort_index(inplace=True)
df_append
# append adds a second data point, where there should only be one
df00b = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[10, 20], [30, 40]])
df_append2 = df_append.append(df00b)
df_append2.sort_index(inplace=True)
df_append2.loc[('2018-01-01', 'A'), :]
# merge - does not what I want, changes column names
df_merge = df00.merge(df01, how='outer', left_index=True, right_index=True)
df_merge
# update - does not what I want, does not add new columns
df_update = df00
df_update.update(df01)
df_update
# join - gives an error, as no suffix defined and join wants to create a new column
df_join = df00
df00.join(df01)
** My issues ** .append() only works if the area (index+column) in the right data frame is not yet in the left data frame. Otherwise it just adds a second data point to the same index/column
.merge() changes the column names, if they exist in both the left and right data frame. But I want the column names to stay the same, and the data to be updated if it already exists
.update() does not append data, if the column / row does not yet exist
.join() gives an error..
What I need is a "update + append if not exist". Any idea how to go about it?
Thanks in advance, Theo
pS: Output from above
df_append
a b c
UTC zone
2018-01-01 A 1.0 2.000000 NaN
B NaN 0.100551 0.271616
2018-01-02 A 3.0 4.000000 NaN
B NaN 0.489322 0.606215
2018-01-03 A 5.0 6.000000 NaN
B NaN 0.245451 0.242021
2018-01-04 A 7.0 8.000000 NaN
B NaN 0.047900 0.642140
df_append2.loc[('2018-01-01', 'A'), :]
a b c
UTC zone
2018-01-01 A 1.0 2.0 NaN
A 10.0 20.0 NaN
df_merge
Out[4]:
a_x b_x a_y b_y
UTC zone
2018-01-01 A 1.0 2.0 NaN NaN
2018-01-02 A 3.0 4.0 NaN NaN
2018-01-03 A NaN NaN 5.0 6.0
2018-01-04 A NaN NaN 7.0 8.0