1
          id      source                 Date       CVI
67092  10365     sat     2016-11-20 00:00:00      1.9
67093  10365     nel-2   2016-11-20 00:00:00      1.5
67094  10365     sat     2016-12-06 00:00:00   1.436964
67095  10365     sat     2016-12-22 00:00:00        NaN
67096  10365     sat     2017-01-07 00:00:00   4.835949
67097  10365     sat     2017-01-23 00:00:00  10.033607
...      ...         ...                  ...        ...
723    84650  nel-2      2020-04-03 00:00:00   5.085851
724    84650  nel-2      2020-04-08 00:00:00   4.375207
725    84650  nel-2      2020-04-13 00:00:00   0.040688
726    84650  nel-2      2020-04-18 00:00:00   3.151000

From the dataframe above, I would like to create a new dataframe where id for same Date and different source are in same row. e.g. the output should look like:

         id              Date                 sat_CVI   nel-2_CVI
0       10365        2016-11-20 00:00:00        1.9        1.5

I tried using groupby but it does not do what I want. Please note that the resulting table has only those rows where the same id has a CVI value for the same Date for both the source

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
user308827
  • 21,227
  • 87
  • 254
  • 417
  • @jezrael, this is different from using just pivot table since I only want those rows where the same `Date` is present for both the sources for the same `id` – user308827 Apr 29 '20 at 04:37
  • Sorry, so not understand, is possible create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? Because `1.5` from output is not in input DataFrame – jezrael Apr 29 '20 at 04:39
  • thanks @jezrael, I updated the example to be MCVE – user308827 Apr 29 '20 at 04:40
  • Super, so `df.pivot_table(index=['id','Date'],columns='source', values='CVI')` not working for you? – jezrael Apr 29 '20 at 04:43
  • 1
    it is close, but it seems to include even rows with `Date` where one `source` has data and the the other one does not. I do not want such rows to be included. E.g. if on `Date` `2020-04-03 00:00:00` only `nel-2` has data in `CVI` column, then that row should not be included in output – user308827 Apr 29 '20 at 04:50
  • I guess I could drop rows where either of the resulting `source` columns has NaNs. Thanks! – user308827 Apr 29 '20 at 04:53
  • hmmm, it seems more complicated, because pivot_table create new missing values. So you need filter first if for each Date are sources `sat` and `nel-2`, then remove all another Dates with one row (from sample data stay only first 2 rows) and last step is pivoting. What do you think? – jezrael Apr 29 '20 at 04:56
  • @jezrael, I meant dropping rows `after` the `pivot_table` solution you specified. That should work I think, or am I missing something? – user308827 Apr 29 '20 at 04:57
  • There is problem if want distingush missing values already in source data (these values need in output) with missing values created by `pivot_table` - this values need remove from output data. – jezrael Apr 29 '20 at 05:04
  • thanks @jezrael, we can safely ignore any values that are already missing. Your current solution works for me. I will be happy to accept. – user308827 Apr 29 '20 at 05:30
  • 1
    hmm, so `df.pivot_table(index=['id','Date'],columns='source', values='CVI').dropna()` is what need? – jezrael Apr 29 '20 at 05:34
  • yes, that works – user308827 Apr 29 '20 at 05:37

1 Answers1

1

If there is no duplicated triples id, Date and source values use DataFrame.set_index with Series.unstack for reshape and then chnage columns names by DataFrame.add_suffix with remove rows with missing values by DataFrame.dropna:

df1 = (df.set_index(['id','Date','source'])['CVI']
         .unstack()
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      1.9

If possible duplicates first solution fail is necessary DataFrame.pivot_table (it also working for no duplicated triples, then is no aggregation) with some aggregate fucntion like sum, mean instead set_index with unstack:

print (df)
          id source                 Date        CVI
67092  10365    sat  2016-11-20 00:00:00   1.900000 <- duplicated triple
67092  10365    sat  2016-11-20 00:00:00   1.000000 <- duplicated triple
67093  10365  nel-2  2016-11-20 00:00:00   1.500000
67094  10365    sat  2016-12-06 00:00:00   1.436964
67095  10365    sat  2016-12-22 00:00:00        NaN
67096  10365    sat  2017-01-07 00:00:00   4.835949
67097  10365    sat  2017-01-23 00:00:00  10.033607
723    84650  nel-2  2020-04-03 00:00:00   5.085851
724    84650  nel-2  2020-04-08 00:00:00   4.375207
725    84650  nel-2  2020-04-13 00:00:00   0.040688
726    84650  nel-2  2020-04-18 00:00:00   3.151000

df1 = (df.pivot_table(index=['id','Date'],columns='source', values='CVI', aggfunc='sum')
         .add_suffix('_CVI')
         .dropna()
         .reset_index()
         .rename_axis(None, axis=1))
print (df1)
      id                 Date  nel-2_CVI  sat_CVI
0  10365  2016-11-20 00:00:00        1.5      2.9
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252