Merging multiple datasets in pandas

Question

I'm struggling to correctly merge a few datasets in pandas. Let's say I've measured variables A, B, and C, at different times. Sometimes, I've got A and B at the same time, and sometimes not. I have three dataframes, where the dataframe's index is the time of measurement, and a column for the measurement. If I concatenate these dataframes, I get a bunch of NaNs where I have no measurements, maybe something like

 idx |  A  |  B  |  C
-----|-----|-----|----
  0  |  1  | NaN | NaN
  0  | NaN |  2  |  3
  1  |  5  |  3  | NaN

In concatenating, I have non-unique time indices. What I'd like is to sort this by time, and collapse together rows with the same time index. The ideal result here is

 idx |  A  |  B  |  C
-----|-----|-----|----
  0  |  1  |  2  |  3
  1  |  5  |  3  | NaN

That would be the first scenario. To further complicate things, I may have a column, D, which specifies the location the measurement was taken. I'd thus need to allow this collapsing to keep non-unique indices as long as the entries in D are different for that time. Maybe we have

 idx |  A  |  B  |  C  |  D
-----|-----|-----|-----|-----
  0  |  1  | NaN | NaN | Paris
  0  | NaN |  2  |  3  |  NYC
  1  |  5  |  3  | NaN |  NYC
  1  | NaN | NaN |  0  | Paris

This dataframe cannot be collapsed any further, because, conditioned on D, it's already got unique times and information is as collapsed as possible.

I'm still trying to get my head around the various join / merge / concat operations and how they work, but I'd love a pointer or two.

Thank you !

Isn't this similar: http://stackoverflow.com/questions/32378308/drop-nan-cells-and-move-non-null-values-according-to-datetime-index? so `df.groupby('idx').sum()` should do what you want? — EdChum, Sep 03 '15 at 18:05
@EdChum : if I group by location, and then sum(), I lose the time-series. Grouping by location results in a dataframe whose index is location. If I make the time of measurement a column and and group by that instead, I can't sum over locations ( as they're strings ), so I lose that column. — Quentin, Sep 03 '15 at 18:08
create a dataframe with index 'timestamp' and columns 'var_name', 'val' for each of A, B, C, and D. from there, concat, sort, and unstack. — acushner, Sep 03 '15 at 18:31

Alexander · Answer 1 · 2015-09-03T18:22:54.497

Assuming that your index is a Timestamp, try to resample it at your desired frequency (e.g. hourly, daily, weekly, etc). You can take the mean measurement in case there are multiple samples observed during the window.

df = pd.DataFrame({'A': {Timestamp('2015-01-01 11:30:00'): 1.0,
  Timestamp('2015-01-01 12:30:00'): nan,
  Timestamp('2015-01-02 11:15:00'): 5.0,
  Timestamp('2015-01-02 12:15:00'): nan},
 'B': {Timestamp('2015-01-01 11:30:00'): nan,
  Timestamp('2015-01-01 12:30:00'): 2.0,
  Timestamp('2015-01-02 11:15:00'): 3.0,
  Timestamp('2015-01-02 12:15:00'): nan},
 'C': {Timestamp('2015-01-01 11:30:00'): nan,
  Timestamp('2015-01-01 12:30:00'): 3.0,
  Timestamp('2015-01-02 11:15:00'): nan,
  Timestamp('2015-01-02 12:15:00'): 0.0},
 'D': {Timestamp('2015-01-01 11:30:00'): 'Paris',
  Timestamp('2015-01-01 12:30:00'): 'NYC',
  Timestamp('2015-01-02 11:15:00'): 'NYC',
  Timestamp('2015-01-02 12:15:00'): 'Paris'}})

>>> df
                      A   B   C      D
2015-01-01 11:30:00   1 NaN NaN  Paris
2015-01-01 12:30:00 NaN   2   3    NYC
2015-01-02 11:15:00   5   3 NaN    NYC
2015-01-02 12:15:00 NaN NaN   0  Paris

>>> df.resample('1D', how='mean')
            A  B  C
2015-01-01  1  2  3
2015-01-02  5  3  0

To account for the point of observation, you need to include it as a multi index column. An easy way to do this is by grouping on date and location (column D), and then unstacking.

>>> df.reset_index().groupby(['index', 'D']).mean().unstack().resample('1D', how='mean')
              A           B           C       
D           NYC  Paris  NYC  Paris  NYC  Paris
index                                         
2015-01-01  NaN      1    2    NaN    3    NaN
2015-01-02    5    NaN    3    NaN  NaN      0

Merging multiple datasets in pandas

1 Answers1