-1

I'm struggling to figure out how to take the elements that are present in one dataframe and use them to fill in the missing values in another based on a column of their time.

I have one that has minute data, but has some gaps in it (spanning almost a day), and another that has hourly data with no gaps in it. I want to fill in the missing rows in the minute data with the hourly data without duplicating the hours that I do have in the minute data.

import pandas as pd
df1 = pd.DataFrame({'Unix Timestamp': [1444311660, 1444311720, 1444311780, 1444311840, 1444311900,
   1444312140], 'price': [242.5, 242.5, 243.7, 290.0, 293.0, 287.0]})

df2 = pd.DataFrame({'Unix Timestamp': [1444311780, 1444311840, 1444311900, 1444311960, 1444312020],
       'price': [243.7, 290.0, 293.0, 295.0, 294.0]})

print(df1.head())
print(df2.head())
df1
Unix Timestamp  price  
1444311660      242.5  
1444311720      242.5
1444311780      243.7
1444311840      290
1444311900      293
1444312140      287

df2
Unix Timestamp  price
1444311780      243.7
1444311840      290
1444311900      293
1444311960      295
1444312020      294

I've tried finding the rows in df2 where the Unix Timestamp isn't in the list of Unix Timestamps in df1, then adding them and resorting based on Unix Timestamp, but it gives me an empty dataframe

missing = df1.loc[~df1['Unix Timestamp'].isin(df2['Unix Timestamp'])]
df1 = pd.concat([df1, missing], ignore_index=True, sort=False)
df1 = df1.sort_values(by='Unix Timestamp')
df1 = df1.reset_index(drop=True)
print(df1.head(10))

Expected Output:

df1
Unix Timestamp  price  
1444311660      242.5  
1444311720      242.5
1444311780      243.7
1444311840      290
1444311900      293
1444311960      295   ^
1444312020      294   ^
1444312140      287

Carets added to draw attention to which rows were added. I also need to use the entire row because there are more columns than price in the real one

Any help?

David Erickson
  • 16,433
  • 2
  • 19
  • 35
markstaa
  • 13
  • 2
  • 1
    Hi @Markstaaa your dataframe is not reproducible. You have spelled `Dataframe` instead of `DataFrame` and the shape of the dataframe returns the error: `ValueError: Shape of passed values is (6, 1), indices imply (1, 1)` You should be able to run this in your jupyter notebook prior to posting a question here. Also, please post expected output. Please see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples for reference. – David Erickson Oct 06 '20 at 21:46
  • thank you for updating your question. I have also helped edit th edata in the beginning of your question and have posted a solution. Please click the checkmark next to the solution if I have solved your question. – David Erickson Oct 06 '20 at 23:08

1 Answers1

0

You can .concat the dataframes and get rid of dupes with drop_duplicates. I have also included an added column with .assign() to the second dataframe, so you can see which values have been added in from the second dataframe to the first.

df = (pd.concat([df1,df2.assign(added='^')])
      .drop_duplicates(subset=['Unix Timestamp', 'price'])
      .sort_values('Unix Timestamp')
      .fillna(''))
df
Out[1]: 
   Unix Timestamp  price added
0      1444311660  242.5      
1      1444311720  242.5      
2      1444311780  243.7      
3      1444311840  290.0      
4      1444311900  293.0      
3      1444311960  295.0     ^
4      1444312020  294.0     ^
5      1444312140  287.0      
David Erickson
  • 16,433
  • 2
  • 19
  • 35