0

This may sound like a strange question at first, but I found it hard to find "standard" terms when talking about elements of data of a long format. So I thought I'd just as well use the same terms as Hadley Wickham uses in one of the first examples in his article on Tidy Data:

enter image description here

In a sample of my real world data, row contains dates, column contains categories, and value contains prices like this:

Input

    row         column  value
0   21.08.2020  A       43
1   21.08.2020  A       36
2   21.08.2020  B       36
3   21.08.2020  C       28
4   22.08.2020  A       16
5   22.08.2020  B       40
6   22.08.2020  B       34

Here, the column values are not as regular as the image above. Some column values are missing for some row values. How can I include those column names in the same dataset with value set to 0? In the sample dataframe above, column C only occurs for row = 21.08.2020:

Is there a pandas function that can take this into consideration and include 22.08.2020 C 0?

Desired output

    row         column  value
0   21.08.2020  A       43
1   21.08.2020  A       36
2   21.08.2020  B       36
3   21.08.2020  C       28
4   22.08.2020  A       16
5   22.08.2020  B       40
6   22.08.2020  B       34
7   22.08.2020  C       0

I've tried an approach with retrieving all unique column values = ['A', 'B', 'C'], and then looping through all row values and inserting the columns missing with value = 0, but that turned into a real mess really fast. So any other suggestions would be great!

Edit: From long to wide using pd.pivot

Using pd.pivot_table(df1,index='row',columns='column',values='value') will turn the Input dataframe above into:

column      A       B       C
row         
21.08.2020  39.5    36.0    28.0
22.08.2020  16.0    37.0    NaN

Here, NaN is included by default for column=C and row=22.08.2020. So the case now remains to melt or pivot this dataframe into the desired output without dropping the NaN.

Edit 2: sample dataframe

import pandas as pd
df=pd.DataFrame({'row': {0: '21.08.2020',
  1: '21.08.2020',
  2: '21.08.2020',
  3: '21.08.2020',
  4: '22.08.2020',
  5: '22.08.2020',
  6: '22.08.2020'},
 'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
 'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}})
vestland
  • 55,229
  • 37
  • 187
  • 305
  • Missing how? The `value` column is empty and you want to insert 0 for those? Or each date needs to have a value/row for all three categories? – wwii Aug 22 '20 at 22:35
  • @wwii There are three unique "columns": `A, B, C`. But C only occurs with a value on `21.08.2020`. That does not necessarily mean that C does not *exist* on `22.08.2020`. It just has not been recorded. And I'd like to include it as an observation on `22.08.2020` with value set to `0`. This is easy enough in this particular sample, but I'm working with many many more observations. Anyway, I would assume that pandas had some built-in function to handle this, but I haven't found anything yet. – vestland Aug 22 '20 at 22:40
  • Do you need to derive the categories (column values) dynamically or is it a known set? – wwii Aug 22 '20 at 22:42
  • @wwii Unknown set. But that's easily retrievable anyway using `df1.column.unique()`? – vestland Aug 22 '20 at 22:46
  • @wwii With your comment, you set me on the right track. Or *hack*. I'll write it up as an answer, but please contribute if you find a more elegant approach. – vestland Aug 22 '20 at 22:52
  • 1
    In pivot part , fillna(0) , then stack ? – BENY Aug 22 '20 at 22:55
  • Or Pivot then stack(dropna=False) ? – BENY Aug 22 '20 at 22:57
  • 1
    yep, @BEN_YO got it : ``pd.pivot_table(df, index="row", columns="column", values="value").stack( dropna=False).reset_index(name="value")`` – sammywemmy Aug 22 '20 at 22:58
  • @sammywemmy Cool! Would you like to write that up as an answer? I somewhat unexpectedly found a way with `unstack` too. I'm going to write it up as an answer too. But of course, *anyonye* else but me with a working suggestion will get the acceptance mark... – vestland Aug 22 '20 at 23:06
  • For me though, the props goes to @BEN_YO. Ben, go ahead, post the solution. I think I even saw something similar in one of your previous posts. – sammywemmy Aug 22 '20 at 23:08
  • @sammywemmy no worry , I think OP already post a answer ~ :-) – BENY Aug 22 '20 at 23:10
  • @BEN_YO I did, but it turned out to be a bit flawed. Some observations are missing because `aggfunc` in `pd.pivot_table` defaults to `np.mean`. Setting it to `None` or `False` doesn't work either – vestland Aug 22 '20 at 23:29
  • @vestland I have post my thought ~ – BENY Aug 22 '20 at 23:37

4 Answers4

2

This is different from previous one since we have multiple value for same row

df['key']=df.groupby(['row','column']).cumcount()

df1 = pd.pivot_table(df,index='row',columns=['key','column'],values='value')

df1 = df1.stack(level=[0,1],dropna=False).to_frame('value').reset_index()

df1 = df1[df1.key.eq(0) | df1['value'].notna()]
df1
Out[97]: 
           row  key column  value
0   21.08.2020    0      A   43.0
1   21.08.2020    0      B   36.0
2   21.08.2020    0      C   28.0
3   21.08.2020    1      A   36.0
6   22.08.2020    0      A   16.0
7   22.08.2020    0      B   40.0
8   22.08.2020    0      C    NaN
10  22.08.2020    1      B   34.0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

I found an approach with pd.pivot() in combination with unstack():

import pandas as pd
df=pd.DataFrame({'row': {0: '21.08.2020',
  1: '21.08.2020',
  2: '21.08.2020',
  3: '21.08.2020',
  4: '22.08.2020',
  5: '22.08.2020',
  6: '22.08.2020'},
 'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
 'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}})

df1 = pd.pivot_table(df,index='row',columns='column',values='value').unstack().reset_index() 
print(df1)

Output

    column  row         0
0   A       21.08.2020  39.5
1   A       22.08.2020  16.0
2   B       21.08.2020  36.0
3   B       22.08.2020  37.0
4   C       21.08.2020  28.0
5   C       22.08.2020  NaN

The order of the dataframe columns are arguably messed up though...

vestland
  • 55,229
  • 37
  • 187
  • 305
  • i just realised that some rows are lost with this method. no 34 – sammywemmy Aug 22 '20 at 23:18
  • Aggregating `value` with mean fits your needs? – wwii Aug 22 '20 at 23:20
  • @sammywemmy `pd.pivot_table(df, index="row", columns="column", values="value").stack( dropna=False).reset_index(name="value")` and `pd.pivot_table(df,index='row',columns='column',values='value').unstack(dropna=False).reset_index() ` seem to produce the same error... – vestland Aug 22 '20 at 23:24
  • @sammywemmy The reasons seems to be that `aggfunc` is unspecified and defaults to `np.mean` – vestland Aug 22 '20 at 23:27
  • Hmmm. I guess other answers suffice. As a side note, why are there different values for the same day and variable? – sammywemmy Aug 23 '20 at 00:00
  • 1
    @sammywemmy Not different, just *more*. We can think of it as the same type of product sold twice, but for different prices. Or we could just stop talking about price all together, and regard it as, for example, that we sold the product A to two different clients in two different quantities. – vestland Aug 23 '20 at 00:03
  • 1
    cool. was in church so could not fully think about it. Ben's answer solves it pretty well. Currently testing some code to see if I can come up with something different. – sammywemmy Aug 23 '20 at 03:36
1

Here is a naive approach - uses a for loop.

data = {'row': {0: '21.08.2020', 1: '21.08.2020', 2: '21.08.2020',
                3: '21.08.2020', 4: '22.08.2020', 5: '22.08.2020',
                6: '22.08.2020'},
        'column': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'A', 5: 'B', 6: 'B'},
        'value': {0: 43, 1: 36, 2: 36, 3: 28, 4: 16, 5: 40, 6: 34}}

df = pd.DataFrame(data)

categories = set(df.column.unique())
tbl = pd.pivot_table(df[['row','column']],values='column',index='row',aggfunc=set)

missing = tbl.column.apply(categories.difference)
missing = filter(lambda x:x[1],missing.items())

d = collections.defaultdict(list)
#d = {'row':[],'column':[],'value':[]}
for row,col in missing:
    for cat in col:
        d['row'].append(row)
        d['column'].append(cat)
        d['value'].append(0)

df2 = df.append(pd.DataFrame(d)).reset_index()

df2 = df.append(pd.DataFrame(d)).reset_index()

Of course all the new values will be at the end and it would need to be sorted if that is an issue.


Intermediate objects:

>>> tbl
               column
row                  
21.08.2020  {A, B, C}
22.08.2020     {A, B}
>>> missing
row
21.08.2020     {}
22.08.2020    {C}
Name: column, dtype: object
>>>
wwii
  • 23,232
  • 7
  • 37
  • 77
1

Here is an alternative.it sets the row and column columns as the new index, gets all possible combinations of values in the row and column columns, and joins(how='outer') an empty dataframe with the row and column combinations as the new index :

 From itertools import product
new_index = product(set(df.row.array), set(df.column.array))
df = df.set_index(["row", "column"])
new_index = pd.DataFrame([], index=pd.Index(new_index, names=["row", "column"]))
df.join(new_index, how="outer").reset_index().astype({"value": "Int8"}) # if you are keen on nullable integers

    row      column value
0   21.08.2020  A   43
1   21.08.2020  A   36
2   21.08.2020  B   36
3   21.08.2020  C   28
4   22.08.2020  A   16
5   22.08.2020  B   40
6   22.08.2020  B   34
7   22.08.2020  C   <NA>
sammywemmy
  • 27,093
  • 4
  • 17
  • 31