0

I have a data frame that looks like this:

Statement        Standard A       Standard B      Standard C
Statement 1      A1               B-1.2.3, 1.2.4
Statement 1      A1                               C2, 3, 4 
Statement 2      A2                               C5  
Statement 3      A3               B-1.2.5
Statement 3      A3                               C6,7
{
    "Statement": ["Statement 1", "Statement 1", "Statement 2",
                  "Statement 3", "Statement 3",],
    "Standard A": ["A1", "A1", "A2", "A3", "A3"],
    "Standard B": ["B-1.2.3, 1.2.4", np.nan, np.nan, "B-1.2.5", np.nan],
    "Standard C": [np.nan, "C2, 3, 4 ", "C5  ", np.nan, "C6,7"],
}

Basically what I need to do is convert it to this:

Statement        Standard A       Standard B      Standard C
Statement 1      A1               B-1.2.3         C2
Statement 1      A1               B-1.2.4         C3
Statement 1      A1                               C4
Statement 2      A2                               C5  
Statement 3      A3               B-1.2.5         C6
Statement 3      A3                               C7

To help clarify, there are 3 Standards. Each standard has a number of standard "references" that map to the "statements".

So standards B-1.2.3 and B-1.2.4 (from Standard B) and C2, C3, and C4 (from Standard C) map to Statement 1. A1 (from Standard A) does as well, but Standard A is already fully mapped.

Right now I have three issues:

  1. The standard prefixes ("B-" and "C") are only applied to the first references in the line. Each standard reference needs the prefix. (Optional)

  2. The individual standard references (within the Standard) are listed on the same row (with ',' delimiter), and each references needs to be on its own row

  3. The standards are staggered (i.e. B-1.2.3 and C2 can be in the same row, but they are not)

I've tried applying the solution here: Split cell into multiple rows in pandas dataframe, but my columns have different lengths and so I get the error stating that fact.

I also tried to adapt this solution, also without success: Pandas dataframe: how do I split one row into multiple rows by multi-value column?

I'm not convinced that either of these solutions is even applicable. I may be barking up the wrong tree.

I'm pretty lost at this point.

clines
  • 725
  • 1
  • 6
  • 10
  • You should edit your question to show what you have tried and where you get stuck. You should also describe how `Statement C` is distributed into the output dataframe. – Alex Sep 09 '21 at 16:42
  • Should have been ```Standard C```, my mistake. I've edited to correct that, and add more information. – clines Sep 09 '21 at 18:02

1 Answers1

1

Here is a working pipeline:

import re
(df.assign(**{'Standard B': df['Standard B'].str.split(',\s*')})
   .explode('Standard B')
   .assign(**{'Standard B': lambda d: d['Standard B'].str.replace('^(\d)', r'B-\1', regex=True),
              'Standard C': lambda d: re.split('\s*,\s*', ', '.join(df['Standard C'].dropna())),
             })
   .assign(**{'Standard C': lambda d: d['Standard C'].str.replace('^(\d)', r'C\1', regex=True),
             })
)

output:

     Statement Standard A Standard B Standard C
0  Statement 1         A1    B-1.2.3         C2
0  Statement 1         A1    B-1.2.4         C3
1  Statement 1         A1        NaN         C4
2  Statement 2         A2        NaN         C5
3  Statement 3         A3    B-1.2.5         C6
4  Statement 3         A3        NaN         C7
mozway
  • 194,879
  • 13
  • 39
  • 75