0

I have a large dataset with about 50 columns, but the columns of concern are listed below:

ID category price code
11 A 10 1
12 B 5 22
12 A 9 26
25 B 13 17
26 A 8 19
26 B 11 6

As you can see, some of my IDs have both categories A and B and some don’t. In my output table I want each ID to have both categories A and B and “blank” cells for missing prices and codes in the newly added rows. My desired output looks like this:

ID category price code
11 A 10 1
11 B
12 A 9 26
12 B 5 22
25 A
25 B 13 17
26 A 8 19
26 B 11 6

Any ideas would be appreciated.

Dina
  • 9
  • 1
  • Markdown tables are nice to look at but difficult to copy. Could you paste the result of applying `.head().to_dict()` to your table? – Ben Grossmann Feb 17 '22 at 02:08
  • {'ID': {0: 13, 1: 22, 2: 22, 3: 32, 4: 33}, 'Category': {0: 4020, 1: 4020, 2: 4020, 3: 4020, 4: 4020}, 'Price': {0: 1, 1: 1, 2: 2, 3: 2, 4: 2}, 'Code': {0: 14, 1: 39, 2: 1, 3: 53, 4: 1}, 'Ranking: {0: 14.0, 1: 15.0, 2: 1.0, 3: 31.0, 4: 1.0}, . . . . – Dina Feb 17 '22 at 02:24
  • Using `df.set_index(['ID','category']).reindex(pd.MultiIndex.from_tuples(itertools.product(df.ID, df.category)))` as in [this post](https://stackoverflow.com/a/53872957/2476977) gets close to what you want, but it results in repeated rows for some reason I don't understand – Ben Grossmann Feb 17 '22 at 02:29
  • yeah it is not working well. But thanks anyway! – Dina Feb 17 '22 at 02:36
  • see my latest edit – Ben Grossmann Feb 17 '22 at 02:37
  • I ran the code but it says "None of ['Category'] are in the columns"! – Dina Feb 17 '22 at 02:46
  • I suspect that's a capitalization issue. I've edited my answer so that the script is self contained and can run with a copy and paste. – Ben Grossmann Feb 17 '22 at 03:17
  • Yes it is working! Thanks a lot. – Dina Feb 17 '22 at 04:04

1 Answers1

0

As in this post, you could do the following.

import itertools
import pandas as pd

data = {'ID': {0: 11, 1: 12, 2: 12, 3: 25, 4: 26, 5: 26},
    'Category': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B'},
    'Price': {0: 10, 1: 5, 2: 9, 3: 13, 4: 8, 5: 11},
    'Code': {0: 1, 1: 22, 2: 26, 3: 17, 4: 19, 5: 6}}
df = pd.DataFrame(data)
df.set_index(['ID','Category']).reindex(
    pd.MultiIndex.from_tuples(
        itertools.product(df.ID.unique(), df.Category.unique())))

The result:

                price  code
11  A          10      1   
    B             NaN   NaN
12  A          9       26  
    B          5       22  
25  A             NaN   NaN
    B          13      17  
26  A          8       19  
    B          11      6   
Ben Grossmann
  • 4,387
  • 1
  • 12
  • 16