creating new dataframe by combining different groups

Question

          id      source                 Date       CVI
67092  10365     sat     2016-11-20 00:00:00      1.9
67093  10365     nel-2   2016-11-20 00:00:00      1.5
67094  10365     sat     2016-12-06 00:00:00   1.436964
67095  10365     sat     2016-12-22 00:00:00        NaN
67096  10365     sat     2017-01-07 00:00:00   4.835949
67097  10365     sat     2017-01-23 00:00:00  10.033607
...      ...         ...                  ...        ...
723    84650  nel-2      2020-04-03 00:00:00   5.085851
724    84650  nel-2      2020-04-08 00:00:00   4.375207
725    84650  nel-2      2020-04-13 00:00:00   0.040688
726    84650  nel-2      2020-04-18 00:00:00   3.151000

From the dataframe above, I would like to create a new dataframe where id for same Date and different source are in same row. e.g. the output should look like:

         id              Date                 sat_CVI   nel-2_CVI
0       10365        2016-11-20 00:00:00        1.9        1.5

I tried using groupby but it does not do what I want. Please note that the resulting table has only those rows where the same id has a CVI value for the same Date for both the source

@jezrael, this is different from using just pivot table since I only want those rows where the same `Date` is present for both the sources for the same `id` — user308827, Apr 29 '20 at 04:37
Sorry, so not understand, is possible create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? Because `1.5` from output is not in input DataFrame — jezrael, Apr 29 '20 at 04:39
Super, so `df.pivot_table(index=['id','Date'],columns='source', values='CVI')` not working for you? — jezrael, Apr 29 '20 at 04:43
it is close, but it seems to include even rows with `Date` where one `source` has data and the the other one does not. I do not want such rows to be included. E.g. if on `Date` `2020-04-03 00:00:00` only `nel-2` has data in `CVI` column, then that row should not be included in output — user308827, Apr 29 '20 at 04:50
I guess I could drop rows where either of the resulting `source` columns has NaNs. Thanks! — user308827, Apr 29 '20 at 04:53
hmmm, it seems more complicated, because pivot_table create new missing values. So you need filter first if for each Date are sources `sat` and `nel-2`, then remove all another Dates with one row (from sample data stay only first 2 rows) and last step is pivoting. What do you think? — jezrael, Apr 29 '20 at 04:56
@jezrael, I meant dropping rows `after` the `pivot_table` solution you specified. That should work I think, or am I missing something? — user308827, Apr 29 '20 at 04:57
There is problem if want distingush missing values already in source data (these values need in output) with missing values created by `pivot_table` - this values need remove from output data. — jezrael, Apr 29 '20 at 05:04
thanks @jezrael, we can safely ignore any values that are already missing. Your current solution works for me. I will be happy to accept. — user308827, Apr 29 '20 at 05:30
hmm, so `df.pivot_table(index=['id','Date'],columns='source', values='CVI').dropna()` is what need? — jezrael, Apr 29 '20 at 05:34

score 1 · Accepted Answer · answered Apr 29 '20 at 05:43

If there is no duplicated triples id, Date and source values use DataFrame.set_index with Series.unstack for reshape and then chnage columns names by DataFrame.add_suffix with remove rows with missing values by DataFrame.dropna:

df1 = (df.set_index(['id','Date','source'])['CVI']
         .unstack()
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      1.9

If possible duplicates first solution fail is necessary DataFrame.pivot_table (it also working for no duplicated triples, then is no aggregation) with some aggregate fucntion like sum, mean instead set_index with unstack:

print (df)
          id source                 Date        CVI
67092  10365    sat  2016-11-20 00:00:00   1.900000 <- duplicated triple
67092  10365    sat  2016-11-20 00:00:00   1.000000 <- duplicated triple
67093  10365  nel-2  2016-11-20 00:00:00   1.500000
67094  10365    sat  2016-12-06 00:00:00   1.436964
67095  10365    sat  2016-12-22 00:00:00        NaN
67096  10365    sat  2017-01-07 00:00:00   4.835949
67097  10365    sat  2017-01-23 00:00:00  10.033607
723    84650  nel-2  2020-04-03 00:00:00   5.085851
724    84650  nel-2  2020-04-08 00:00:00   4.375207
725    84650  nel-2  2020-04-13 00:00:00   0.040688
726    84650  nel-2  2020-04-18 00:00:00   3.151000

df1 = (df.pivot_table(index=['id','Date'],columns='source', values='CVI', aggfunc='sum')
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      2.9

creating new dataframe by combining different groups

1 Answers1