0

I am trying to split a dataframe column into multiple columns as under:

  1. There are three columns overall. Two should rename in the new dataframe, while the third one to be split into new columns.
  2. Split is to be done using a specific character (say ":")
  3. The column that requires split can have varied number of ":" split. So new columns can be different for different rows, leaving some column values as NULL for some rows. That is okay.
  4. Each subsequently formed column has a specific name. Max number of columns that can be formed is known.
  5. There are four dataframes. Each one has this same formatted column that has to be split.

I came across following solutions but they don't work for the reasons mentioned:

  1. Link

    pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)

    • This creates columns with names as 0,1,2... I need the new columns to have specific names.
  2. Link

    df = df.apply(lambda x:pd.Series(x))

    • This does no change to the dataframe. Couldn't understand why.
  3. Link

    df['command'], df['value'] = df[0].str.split().str

    • Here the column names are renamed properly, but this requires knowing beforehand how many columns will be formed. In my case, it is dynamic for each dataframe. For rows, the split successfully puts NULL value in extra columns. But using the same code for another dataframe generates an error saying number of keys should be same.

I couldn't post comments on these answers as I am new here on this community. I would appreciate if someone can help me understand how I can achieve my objective - which is: Dynamically use same code to split one column into many for different dataframes on multiple occasions while renaming the newly generated columns to predefined name.

For example:

Dataframe 1:

    Col1    Col2            Col3
0   A       A:B:C           A
1   A       A:B:C:D:E       A
2   A       A:B             A

Dataframe 2:

    Col1    Col2            Col3
0   A       A:B:C           A
1   A       A:B:C:D         A
2   A       A:B             A

Output should be: New dataframe 1:

    Col1   ColA   ColB   ColC   ColD   ColE   Col3
0   A      A      B      C      NaN    NaN    A
1   A      A      B      C      D      E      A
2   A      A      B      NaN    NaN    NaN    A

New dataframe 2:

    Col1   ColA   ColB   ColC   ColD   ColE   Col3
0   A      A      B      C      NaN    NaN    A
1   A      A      B      C      D      NaN    A
2   A      A      B      NaN    NaN    NaN    A

(If ColE is not there, then also it is fine.)

After this, I will be concatenating these dataframes into one, where I will need counts of all ColA to ColE for individual dataframes against Col1 and Col3 combinations. So, we need to keep this in mind.

Meet
  • 461
  • 4
  • 19
  • are the output examples in the wrong order? – Pablo C Jan 29 '21 at 14:45
  • Yes, it was. Sorry for that. Updated question. – Meet Jan 29 '21 at 15:44
  • Did my answer help you? – Pablo C Jan 29 '21 at 16:27
  • @PabloC Hey. Thanks for the response. Yes part of it. The actual data and column names are not as per the example. So we do not need the initial max split to get column names. I still have to get the max number of columns from all the dataframes manually to decide on the column names. Post that, I do this: `df = pd.concat([df[old_columns], df[column_for_split].str.split(':', expand=True)], axis=1)` and rename the columns either way: `df = df.rename(columns= {0: name1, 1: name2, 2: name3, 3: name4...})` or `df.columns = [old_columns,new_columns]` – Meet Jan 30 '21 at 07:39

1 Answers1

0

You can do it this way:

columns = df.Col2.max().split(':')
#['A', 'B', 'C', 'D', 'E']

new = df.Col2.str.split(":", expand = True)
new.columns = columns
new = new.add_prefix("Col")

df.join(new).drop("Col2", 1)
#  Col1 Col3 ColA ColB  ColC  ColD  ColE
#0    A    A    A    B     C  None  None
#1    A    A    A    B     C     D     E
#2    A    A    A    B  None  None  None

Pablo C
  • 4,661
  • 2
  • 8
  • 24