1

Context: I'm trying to pivot a long format dataframe to a wide format dataframe, however, I'm noticing a weird pattern on the wide format dataframe. It seems that if we have repeated values for the index (in my case, a date), it's almost like it's giving me an average instead of repeating each index value and keeping the original values?

Here's a minimal reproducible example:

    import datetime
    import pandas as pd
long_dataframe = pd.DataFrame({"Date": [
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-01-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-02-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-03-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date(),
    datetime.datetime.strptime("01-04-2020", '%m-%d-%Y').date()
], "A": [
    "category_X", "category_X", "category_X", "category_X", "category_X", "category_X", "category_X",
    "category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y", "category_Y"], "Values": [30, 40, 20, 30, 40, 50, 60,25,30,42,54,21,23,30]})

wide_dataframe = long_dataframe.reset_index().pivot_table(
    index="Date", columns="A", values="Values")

wide_dataframe

Which gives me this:

A           category_X  category_Y
Date
2020-01-01        35.0        27.5
2020-01-02        25.0        48.0
2020-01-03        40.0        21.0
2020-01-04        55.0        26.5

How can I make it so that I see the repeated dates with their original values? Why is it that for 2020-01-01 its giving the value in between this date (30 and 40)?

Desired output would look something like this:

A           category_X    category_Y
Date
2020-01-01          30       ...
2020-01-01          40
2020-01-02          20
2020-01-02          30
2020-01-03          40
2020-01-04          50
2020-01-04          60

How can I do this while keeping duplicated indices?

I was thinking of giving each row a unique ID, but I'd really like to do this directly using the dates if possible (without creting any additional IDs)

Thank you!

Chronicles
  • 436
  • 1
  • 11

1 Answers1

2

pivot_table automatically aggregates. Since you cannot have duplicate values of the index after pivoting, you need to create a unique index. You can do this with groupby() and cumcount().

long_dataframe['count'] = long_dataframe.groupby('A').cumcount()

wide_dataframe = long_dataframe.pivot(index=['Date', 'count'], columns='A', values='Values') \
                               .reset_index() \
                               .drop('count', axis=1)

Output:

A        Date  category_X  category_Y
0  2020-01-01          30          25
1  2020-01-01          40          30
2  2020-01-02          20          42
3  2020-01-02          30          54
4  2020-01-03          40          21
5  2020-01-04          50          23
6  2020-01-04          60          30
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21
  • Hey, thank your your answer! It does work in the small example I gave, but I was trying this on a bigger dataset and I'm now getting an error "Length of values (3382) does not match length of index(1)). I'm aware its a bit difficult to see where the error is with just this, but do you have any idea to why this is happening? Thanks! – Chronicles Feb 02 '23 at 18:35
  • Is this happening on the `cumcount` or `pivot` step? – Stu Sztukowski Feb 02 '23 at 18:36
  • in the pivot step (note that I have more columns thatn the "A" column in my actual dataset). It seems that the groupby and index have different sizes? – Chronicles Feb 02 '23 at 18:38
  • Which version of Pandas are you on? Mine is running 1.4.3. The question linked below has a similar error on `pivot` (though not exact). I'm unsure of what the issue could be on your end without a reproducible example. https://stackoverflow.com/questions/65858973/valueerror-on-pandas-dataframe-pivot-example – Stu Sztukowski Feb 02 '23 at 18:41
  • Using 1.4.4 pandas – Chronicles Feb 02 '23 at 18:43
  • Hmmm, unfortunately I can't seem to figure this one out as I've never run into this error when pivoting before. We might need to wait for someone else to check into it. In the meantime, check this link out to see if there are any other methods you can use. Check out the answers to Question 10: https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe – Stu Sztukowski Feb 02 '23 at 18:47
  • 1
    Thanks, no worries! I'll accept the answer since it did answer the question, I just need to think of a better example next time! Thank you :) – Chronicles Feb 02 '23 at 18:54
  • 1
    Okay found the error (was a code error!). It just so happens that when doing the index = ['Date','count'], I was actually passing a variable as a list as such: [date,'count']. `date` is a list so it would be the same as doing `[['Date'],'count']`. Doing `[date[0],'count']` fixed the issue – Chronicles Feb 02 '23 at 19:10