1

I have a dataframe that looks like this:

Person  Day  Value
 1      Mon   32
 1      Tue   30
 1      Wed   34
 1      Thu   22
 1      Fri   43
 2      Mon   22
 2      Wed   14
 2      Fri   11
 3      Tue   13
 3      Wed   22
 3      Thu   23

The dataset looks at value for every weekday (Mon-Fri) for a group of people. Some people have the full 5 days (e.g. person 1), some people only have a few days. I would like to create new empty rows so that everyone has the full 5 days even if they don't have a value.

This is the output I would like:

Person  Day  Value
 1      Mon   32
 1      Tue   30
 1      Wed   34
 1      Thu   22
 1      Fri   43
 2      Mon   22
 2      Tue   Nan
 2      Wed   14
 2      Thu   Nan
 2      Fri   11
 3      Mon   Nan
 3      Tue   13
 3      Wed   22
 3      Thu   23
 3      Fri   Nan

I've tried to write something, but I can't figure out how to have it iterate through each person properly:

days = ['mon','tue','wed','thu','fri']


def normalise(person):
    newperson = pd.DataFrame()
    for day in days:
        if day in person:
            newperson[day] = days
        else:
            newperson[day] = np.nan
    return newperson

normalised = normalise(df)

This just generates the days but with no other value contained. The actual dataset has around 200,000 people. Any ideas would be great, thanks!

Jameson
  • 167
  • 2
  • 11
  • Please go through the [intro tour](https://stackoverflow.com/tour), the [help center](https://stackoverflow.com/help) and [how to ask a good question](https://stackoverflow.com/help/how-to-ask) to see how this site works and to help you improve your current and future questions, which can help you get better answers. "Show me how to solve this coding problem?" is off-topic for Stack Overflow. You have to make an honest attempt at the solution, and then ask a *specific* question about your implementation. Stack Overflow is not intended to replace existing tutorials and documentation. – Prune Apr 03 '21 at 20:10
  • Where are you stuck with this? You seem to know how to recognize that you're missing a day, and you know what value to fill in. – Prune Apr 03 '21 at 20:10
  • Please provide the expected see [MRE - Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Show where the intermediate results deviate from the ones you expect. We should be able to paste a single block of your code into file, run it, and reproduce your problem. This also lets us test any suggestions in your context. [Include your minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of the example. – Prune Apr 03 '21 at 20:10
  • Sorry @Prune. I wasn't sure to post my attempt as I believed it wasn't helpful, but did so anyway just as you left your comments. As for the other issues, I was just looking for a general direction of things and did not have a specific error to ask a question about. Sorry that you still thought my question was inappropriate. – Jameson Apr 03 '21 at 20:36

2 Answers2

5

unstack and stack back with dropna=False:

days = ['Mon','Tue','Wed','Thu','Fri']

out = (df.set_index(['Person','Day'])['Value'].unstack()
         .reindex(days ,axis=1)
         .stack(dropna=False).reset_index(name='Value'))

    Person  Day  Value
0        1  Mon   32.0
1        1  Tue   30.0
2        1  Wed   34.0
3        1  Thu   22.0
4        1  Fri   43.0
5        2  Mon   22.0
6        2  Tue    NaN
7        2  Wed   14.0
8        2  Thu    NaN
9        2  Fri   11.0
10       3  Mon    NaN
11       3  Tue   13.0
12       3  Wed   22.0
13       3  Thu   23.0
14       3  Fri    NaN
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    .unstack() and stack() seems to be all I needed. I kept getting an error "TypeError: Expected tuple, got str” with reindex because of 'axis=1'. I ended up just using: out = (df.set_index(['Person','Day'])['Value'].unstack().stack(dropna=False).reset_index()). This worked perfectly, thank you. – Jameson Apr 03 '21 at 20:33
2

@anky's solution is great; I'll suggest an alternative, that could be handy in abstracting the solution (especially for non-unique indices and nulls).

You can use the complete function from pyjanitor to expose the explicitly missing values:

# pip install janitor
import pandas as pd
import numpy as np
import janitor

In [5]: df.complete(['Person', 'Day'])
Out[5]: 
    Person  Day  Value
0        1  Fri   43.0
1        1  Mon   32.0
2        1  Thu   22.0
3        1  Tue   30.0
4        1  Wed   34.0
5        2  Fri   11.0
6        2  Mon   22.0
7        2  Thu    NaN
8        2  Tue    NaN
9        2  Wed   14.0
10       3  Fri    NaN
11       3  Mon    NaN
12       3  Thu   23.0
13       3  Tue   13.0
14       3  Wed   22.0

If you want to maintain the order of the days, you could convert the 'Day' column to a categorical type:

In [7]: (df.astype({"Day":pd.api.types.CategoricalDtype(categories=df.Day.unique(), 
                                                        ordered=True)})
           .complete(['Person', 'Day']))
Out[7]: 
    Person  Day  Value
0        1  Mon   32.0
1        1  Tue   30.0
2        1  Wed   34.0
3        1  Thu   22.0
4        1  Fri   43.0
5        2  Mon   22.0
6        2  Tue    NaN
7        2  Wed   14.0
8        2  Thu    NaN
9        2  Fri   11.0
10       3  Mon    NaN
11       3  Tue   13.0
12       3  Wed   22.0
13       3  Thu   23.0
14       3  Fri    NaN
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    This is a great solution, and is what I ended up using. pyjanitor is a great package, so thanks so much for introducing me to it. – Jameson Apr 05 '21 at 20:23