0

Hei,

I want to combine data frames (incidentally with a multi-index), into a larger data frame. Sometimes the data needs to be appended (new rows or columns are being added), sometime existing data needs to be updated. Somehow, I cant find a way to do both. It is either append (using .append()) or some sort of updating (.merge(), .update()) I tried to search for this and read the documentation, but cant really figure it out.

This is a test code

import pandas as pd
import numpy as np

zones = ['A', 'B', 'C']

# input data frames
dates0 = pd.date_range('20180101', '20180102', freq='D')
dates1 = pd.date_range('20180103', '20180104', freq='D')

idx00 = pd.MultiIndex.from_product(iterables=[dates0, [zones[0]]], names=    ['UTC', 'zone'])
df00 = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[1, 2], [3, 4]])

idx01 = pd.MultiIndex.from_product(iterables=[dates1, [zones[0]]], names=['UTC', 'zone'])
df01 = pd.DataFrame(index=idx01, columns=['a', 'b'], data=[[5, 6], [7, 8]])

idx10 = pd.MultiIndex.from_product(iterables=[dates0, [zones[1]]], names=['UTC', 'zone'])
df10 = pd.DataFrame(index=idx10, columns=['b', 'c'], data=np.random.rand(2, 2))

idx11 = pd.MultiIndex.from_product(iterables=[dates1, [zones[1]]], names=['UTC', 'zone'])
df11 = pd.DataFrame(index=idx11, columns=['b', 'c'], data=np.random.rand(2, 2))

# append - works, but only if the data is not yet there
df_append = df00.append(df01)
df_append = df_append.append(df10)
df_append = df_append.append(df11)
df_append.sort_index(inplace=True)
df_append

# append adds a second data point, where there should only be one
df00b = pd.DataFrame(index=idx00, columns=['a', 'b'], data=[[10, 20], [30, 40]])
df_append2 = df_append.append(df00b)
df_append2.sort_index(inplace=True)
df_append2.loc[('2018-01-01', 'A'), :]

# merge - does not what I want, changes column names
df_merge = df00.merge(df01, how='outer', left_index=True, right_index=True)
df_merge

# update - does not what I want, does not add new columns
df_update = df00
df_update.update(df01)
df_update

# join - gives an error, as no suffix defined and join wants to create a new column
df_join = df00
df00.join(df01)

** My issues ** .append() only works if the area (index+column) in the right data frame is not yet in the left data frame. Otherwise it just adds a second data point to the same index/column

.merge() changes the column names, if they exist in both the left and right data frame. But I want the column names to stay the same, and the data to be updated if it already exists

.update() does not append data, if the column / row does not yet exist

.join() gives an error..

What I need is a "update + append if not exist". Any idea how to go about it?

Thanks in advance, Theo

pS: Output from above

df_append

                   a         b         c
UTC        zone                         
2018-01-01 A     1.0  2.000000       NaN
           B     NaN  0.100551  0.271616
2018-01-02 A     3.0  4.000000       NaN
           B     NaN  0.489322  0.606215
2018-01-03 A     5.0  6.000000       NaN
           B     NaN  0.245451  0.242021
2018-01-04 A     7.0  8.000000       NaN
           B     NaN  0.047900  0.642140

df_append2.loc[('2018-01-01', 'A'), :]

                    a     b   c
UTC        zone                
2018-01-01 A      1.0   2.0 NaN
           A     10.0  20.0 NaN

df_merge

Out[4]: 
                 a_x  b_x  a_y  b_y
UTC        zone    
2018-01-01 A     1.0  2.0  NaN  NaN
2018-01-02 A     3.0  4.0  NaN  NaN
2018-01-03 A     NaN  NaN  5.0  6.0
2018-01-04 A     NaN  NaN  7.0  8.0
Theo
  • 53
  • 2
  • 5

1 Answers1

1

It looks like you can use pd.concat() or df00.append(), both will do the same. Using your sample data, we can combine like so:

pd.concat([df00, df01])

You can pass verify_integrity=True to either to throw an error when duplicates exist. Alternatively, you can concat/append and combine with .drop_duplicates() if there will be overlapping values to avoid errors:

df_concat = pd.concat([df00, df01]).drop_duplicates(keep='last')

enter image description here

Since the above will drop duplicates rows without taking the index into consideration, you can try this approach:

sample df (with duplicate rows, not index):
                 a  b
UTC        zone      
2018-01-01 A     1  2
2018-01-02 A     3  4
2018-01-03 A     1  2
2018-01-04 A     7  8

df_concat = pd.concat([df00, df01]).groupby(level=[0,1]).last()
                 a  b
UTC        zone      
2018-01-01 A     1  2
2018-01-02 A     3  4
2018-01-03 A     1  2
2018-01-04 A     7  8
Andrew L
  • 6,618
  • 3
  • 26
  • 30
  • Thanks for the fast reply. I had to use a keep='last' to make it work. Also, the inplace=True seems to fail if used with the concat before (becaus concat does not support inplace?). This is what I use now. if you adjust your answer Ill accept it =) `df_concat = pd.concat([df00, df01, df10, df11])` `df_concat = pd.concat([df_concat, df00b]).drop_duplicates(keep='last')` `df_concat.sort()` Throwing the error on duplicates was for my problem only half the answer, as it doesn't inform me whether all values are duplicates, or some are duplicates, some are new. – Theo Jan 22 '18 at 13:20
  • Actually, still has an issue- it also drops rows with the same entry but different indexes. I tried the subset keyword, but that fails with an error if I define all indexes df_concat.drop_duplicates(keep='last', subset=['UTC', 'zone']) – Theo Jan 22 '18 at 13:34
  • @Theo - Glad this helped. I must not have been thinking when I passed `inplace=True`.. this will work if you use `drop_duplicates()` on the newly created dataframe on a separate line. If you're expecting frequent duplicates, intentionally throwing an error might not be the best approach. Either way dropping duplicates seems the way to go. Thanks again. – Andrew L Jan 22 '18 at 13:34
  • @Theo - Interesting- checking this now – Andrew L Jan 22 '18 at 13:37
  • @Theo - It looks like you have a couple options: 1) reset the index (it will then appear as a column), apply drop duplicates; 2) groupby level [0,1], use `last()` groupby method. It doesn't seem `drop_duplicates()` takes the index into consideration. I'll post example above. – Andrew L Jan 22 '18 at 13:44
  • A solution is here (and in a way my question is a duplicate of that question: [link](https://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries/34297689#34297689) `df_concat = pd.concat([df00, df01, df10, df11]) # dont drop any rows df_concat = df_concat[~df_concat.index.duplicated(keep='last')]` `df_concat = df_concat.append(df00b) # drop the correct (first) rows with duplicate index df_concat = df_concat[~df_concat.index.duplicated(keep='last')]` – Theo Jan 22 '18 at 13:48
  • That won't work because your index isn't actually duplicated. You'll need to look at index + rows holistically (that approach is index only, drop_duplicates is rows only). – Andrew L Jan 22 '18 at 13:53
  • Well, df00 and df00b use the same index. Also, that is what would happen in my actual data (same timestamp and location as index, but possibly updated data), so that should work, no? Sorry for my quite verbose example.. – Theo Jan 22 '18 at 13:58
  • Basically the grouping piece will allow you to de-duplicate records if everything (including index) exists in the concat dataframe. I think this is what you'll need. – Andrew L Jan 22 '18 at 14:46