Suppose I have df_in
below:
df_in = pd.DataFrame({'X': ['a', 'b', 'c'], 'A': [1, 0, 0], 'B': [1, 1, 0]})
df_in
:
+---+---+---+---+
| | X | A | B |
+---+---+---+---+
| 0 | a | 1 | 1 |
| 1 | b | 0 | 1 |
| 2 | c | 0 | 0 |
+---+---+---+---+
I want to achieve something like the following:
df_out = pd.DataFrame({'X': ['a', 'a', 'b'], 'Y': ['A', 'B', 'B']})
df_out
:
+---+---+---+
| | X | Y |
+---+---+---+
| 0 | a | A |
| 1 | a | B |
| 2 | b | B |
+---+---+---+
I also have a list containing the columns: l = list(['A', 'B'])
. The logic is, for each column in df_in
that is in l
, repeat those observations where the column value == 1
, and add the column name to a new column in df_out
, this is Y
in the example. In reality there are more columns in df_in
and not all of them are in l
, which is why I want to solve this without explicit references to columns A
, B
and X
.
NOTE: This is not entirely covered by this answer since, as stated above, there are many columns in reality, and these can be of any type and data, so the solution, df_out
, needs to take into account all of the original columns (X
in this case). In theory, X
can also be a binary 0/1
column, but should only affect the outcome in the same way as A
and B
if it's included in l
. I hope this helps clarify.