8

I have a data frame which has rows for each user joining my site and making a purchase.

+---+-----+--------------------+---------+--------+-----+
|   | uid |        msg         |  _time  | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 |   1 | confirmed_settings | 1/29/15 | M      |  37 |
| 1 |   1 | sale               | 4/13/15 | M      |  37 |
| 2 |   3 | confirmed_settings | 4/19/15 | M      |  35 |
| 3 |   4 | confirmed_settings | 2/21/15 | M      |  21 |
| 4 |   5 | confirmed_settings | 3/28/15 | M      |  18 |
| 5 |   4 | sale               | 3/15/15 | M      |  21 |
+---+-----+--------------------+---------+--------+-----+

I would like to change the dataframe so that each row is unique for a uid and there is a columns called sale and confirmed_settings which have the timestamp of the action. Note that not every user has a sale, but every user has a confirmed_settings. Like below:

+---+-----+--------------------+---------+---------+--------+-----+
|   | uid | confirmed_settings |  sale   |  _time  | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 |   1 | 1/29/15            | 4/13/15 | 1/29/15 | M      |  37 |
| 1 |   3 | 4/19/15            | null    | 4/19/15 | M      |  35 |
| 2 |   4 | 2/21/15            | 3/15/15 | 2/21/15 | M      |  21 |
| 3 |   5 | 3/28/15            | null    | 3/28/15 | M      |  18 |
+---+-----+--------------------+---------+---------+--------+-----+

To do this, I am trying:

df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')

But I get this error: ValueError: Index contains duplicate entries, cannot reshape

How can I pivot a df with duplicate index values to transform my dataframe?


Edit: df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()

gives this error DataError: No numeric types to aggregate but im not even sure that is the right path to go on.

metersk
  • 11,803
  • 21
  • 63
  • 100
  • 2
    There was an answer in [this](https://stackoverflow.com/questions/11232275/pandas-pivot-warning-about-repeated-entries-on-index) post where someone used pivot_table() instead of pivot and it resolved the same error, might be worth trying? – Benjamin Rowell Apr 28 '15 at 18:06
  • @benjamin I saw that post and tried earlier, but got this error - `DataError: No numeric types to aggregate` and decided not continue down that path, because I understand that error even less. Any thoughts? – metersk Apr 28 '15 at 18:07
  • 1
    Try doing it again, only this time coerce your 'uid' column into a numeric type by running this first: `df['uid'] = df['uid'].astype(int)` – Benjamin Rowell Apr 28 '15 at 18:18
  • @benjamin still getting the same error – metersk Apr 28 '15 at 18:25
  • 1
    Sorry, that's as best as I know for the moment. Hope someone has an answer for you. – Benjamin Rowell Apr 28 '15 at 18:27

3 Answers3

6

I suspect there are indeed duplicate uid-msg entries/keys (e.g. uid 2 has 2 confirmed_settings entries under msg), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1, not the original DataFrame df.

pivot_table lets you do it however with the aggfunc argument. How about something like this?

df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)

This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivot on df to successfully pivot _time.

selwyth
  • 2,417
  • 16
  • 19
  • In my case I had a timeseries-indexed dataframe. With this pivot_table I discovered (what's obvious in retrospect) that in November "fall back", there are two hours with the same timestamps, same `"columns"` (here `'hub'`), but different `'values'`. In that case I used `df = df[~df.duplicated(subset = ['Date', 'hub'], keep='first')]` to just throw away the second hour. – GrayOnGray Feb 04 '21 at 20:05
4

x is the data frame that you have as input :

    uid               msg   _time   gender  age
0   1   confirmed_settings  1/29/15 M       37
1   1   sale                4/13/15 M       37
2   3   confirmed_settings  4/19/15 M       35
3   4   confirmed_settings  2/21/15 M       21
4   5   confirmed_settings  3/28/15 M       18
5   4   sale                3/15/15 M       21

y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)

gives you:

    uid _time   gender  age     confirmed_settings  sale
0   1   1/29/15     M   37                    NaN   NaN
1   1   4/13/15     M   37                1/29/15   4/13/15
2   3   4/19/15     M   35                    NaN   NaN
3   4   2/21/15     M   21                4/19/15   NaN
4   5   3/28/15     M   18                2/21/15   3/15/15
5   4   3/15/15     M   21                3/28/15   NaN
fixxxer
  • 15,568
  • 15
  • 58
  • 76
  • 1
    Still getting this error `ValueError: Index contains duplicate entries, cannot reshape` on just executing this bit of code `y = x.pivot(index='uid', columns='msg', values='_time')` – metersk Apr 28 '15 at 18:25
  • I'm on pandas version 0.16. Which one are you on ? – fixxxer Apr 28 '15 at 18:26
  • my version is `0.15.2` – metersk Apr 28 '15 at 18:28
  • Check if the other answer works ? Or upgrade to 0.16 ? – fixxxer Apr 28 '15 at 18:32
  • Going to try an upgrade – metersk Apr 28 '15 at 18:32
  • 2
    Ok - so I upgraded to 16.0 and it didn't solve the problem, but I think I found my issue. My data set is much larger than my example, and there must be some data error where a user has two sales or two confirm_settings. I'm going to try and cleanse the data better and your solution should work. – metersk Apr 28 '15 at 18:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/76479/discussion-between-fixxxer-and-meepl). – fixxxer Apr 28 '15 at 18:47
3

You can use groupby to aggregate by the common factors, take the max of time to get the most recent dates, and then unstack the msg to view confirmed_settings and sale side by side:

df.groupby(['uid', 'msg', 'gender', 'age']).time.max().unstack('msg')

msg            confirmed_settings     sale
uid gender age                            
1   M      37             1/29/15  4/13/15
3   M      35             4/19/15      NaN
4   M      21             2/21/15  3/15/15
5   M      18             3/28/15      NaN
Alexander
  • 105,104
  • 32
  • 201
  • 196