1

I have the following pd.DataFrame representing constraints of an optimisation problem.

          FEATURE     COLOR        CLASS       CONSTRAINTS
0         1.0         NaN          NaN         0.20
1         3.0         NaN          NaN         0.20
2         1.0         1.0          NaN         0.15
3         1.0         NaN            b        -0.05
4         1.0         1.0            a        -0.07
5         1.0         1.0            b        -0.10
6         3.0         1.0          NaN         0.10
7         NaN         NaN          NaN         0.20

Here FEATURE represents a categorical variable with possible values [1,2,3], COLOR represents a categorical variable with possible values [1,2] and CLASS is another categorical variable with possible values [a,b,c,d].

Missing values here have the meaning "all other values". In this sense the dataframe is a compressed version of a larger dataframe encompassing all or some of the combinations of the columns categories.

What I would like to do here is to "expand" the NaN values to all possible values each individual column can represent. For example row 0 would expand to 8 total rows, being the product of "free" features, namely COLOR with possible values [1,2] and CLASS with possible values `[a,b,c,d]``

new     FEATURE   COLOR    CLASS  CONSTRAINTS
0       1         1        a      0.2
1       1         1        b      0.2
2       1         2        a      0.2
3       1         2        b      0.2
4       1         1        a      0.2
5       1         1        b      0.2
6       1         2        a      0.2
7       1         2        b      0.2

How can I efficiently perform this transformation in Pandas?

linello
  • 8,451
  • 18
  • 63
  • 109
  • 1
    kindly share the expected output for the complete dataframe. if @MustafaAdyin solution is what you are after, then kindly accept it – sammywemmy Apr 14 '21 at 21:36

1 Answers1

1

Here is one way:

mapping = {'FEATURE': [1, 2, 3], 'COLOR': [1, 2], 'CLASS': ['a', 'b', 'c', 'd']}
cols = mapping.keys()

for col in cols:
    df[col] = df[col].apply(lambda x: mapping[col] if pd.isna(x) else x)
    df = df.explode(col)

For each column of interest, we fill the NaNs with the possible values (note that fillna doesn't fill with lists, hence the apply). Then we explode the so-filled column and repeat this for all three of the columns.

We get the df at the end as:

  FEATURE COLOR CLASS  CONSTRAINTS
0       1     1     a         0.20
0       1     1     b         0.20
0       1     1     c         0.20
0       1     1     d         0.20
0       1     2     a         0.20
0       1     2     b         0.20
0       1     2     c         0.20
0       1     2     d         0.20
1       3     1     a         0.20
1       3     1     b         0.20
1       3     1     c         0.20
1       3     1     d         0.20
1       3     2     a         0.20
1       3     2     b         0.20
1       3     2     c         0.20
1       3     2     d         0.20
2       1     1     a         0.15
2       1     1     b         0.15
2       1     1     c         0.15
2       1     1     d         0.15
3       1     1     b        -0.05
3       1     2     b        -0.05
4       1     1     a        -0.07
5       1     1     b        -0.10
6       3     1     a         0.10
6       3     1     b         0.10
6       3     1     c         0.10
6       3     1     d         0.10
7       1     1     a         0.20
7       1     1     b         0.20
7       1     1     c         0.20
7       1     1     d         0.20
7       1     2     a         0.20
7       1     2     b         0.20
7       1     2     c         0.20
7       1     2     d         0.20
7       2     1     a         0.20
7       2     1     b         0.20
7       2     1     c         0.20
7       2     1     d         0.20
7       2     2     a         0.20
7       2     2     b         0.20
7       2     2     c         0.20
7       2     2     d         0.20
7       3     1     a         0.20
7       3     1     b         0.20
7       3     1     c         0.20
7       3     1     d         0.20
7       3     2     a         0.20
7       3     2     b         0.20
7       3     2     c         0.20
7       3     2     d         0.20

which has 52 rows as expected.

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38