Split single Dataframe Column into multiple columns

Question

I have data along the lines below (although many more rows than the example of course). The data can appear in diferent order.

df = pd.DataFrame({'SmVariant': ['1xFBBC', float('nan'), '2xFBBA', '5xABIA', \
                                '2xFBBC, 1xFBBA', '1xFBBA', '4xABIA', \
                                '1xFBBA, 1xFBBC', float('nan'), '1xFBBA', \
                                '3xFBBA, 1xFBBC']})

I want to split into numerical columns like this: (ultimately to get totals)

FBBA  FBBC  ABIA
1           
          
2            
              5
1       2
1
              4
1       1

1
3       1

Welcome to stack overflow! What have you tried so far, and what went wrong with your attempts? Please [edit] your question to include a [mcve] — G. Anderson, Oct 15 '20 at 19:12

Tom · Answer 1 · 2020-10-15T19:32:09.657

I assume you mean a pandas DataFrame. I also assume that you know the different types of element upfront and can put them into a dictionary like so (to map the elements into the final columns:

cols={'AAAA':0, 'BBBB': 1, 'CCCC': 2}

Next write a function that converts a specific element into the multiple columns:

def expand_element(el):
  res = [0]*len(cols)
  for item in el.split(','):
    q, name = item.split('x')
    res[cols[name]]=int(q)
  return res

Finally, use that function and apply it to each element in your dataframe as follows:

df.apply(lambda x: expand_element(x[0]), axis=1, result_type='expand')

Here is my interactive session showing input and outputs:

>>> df=pd.DataFrame({"c1": ["1xAAAA,2xBBBB", "1xAAAA,2xBBBB"]})
>>> df.apply(lambda x: expand_element(x[0]), axis=1, result_type='expand')
   0  1  2
0  1  2  0
1  1  2  0

That looks interesting. I do want to maintain the "AAAA", "BBBB" etc as headings. — Wayne_AB, Oct 15 '20 at 22:30

CypherX · Accepted Answer · 2020-10-18T23:56:49.237

Solution

You can do it in a single line using regex + pandas method-chaining as follows. I have broken it into multiple lines for better readability. For more details please see Section C below. ⭐

Note: Sections A and B use the data from section D, which the OP had shared earlier. Later the data in the question was changed and Section C gives the solution to this use-case.

Regex Examples:

In the interest of explaining the how regex-pattern works, please take a look at the following three examples:

example-1: I have used this regex pattern in Section-A. This will help work given we have the data as given in Section-D.
example-2: This is an improved version of the regex in example-1. I have used this regex pattern in Section-B and C.
example-3: Finally, this example illustrates how the pattern used in example-2 could skip unwanted text and only pick intended text-parrens.

# without alphabetically ordering the columns
(df[COLUMN_NAME]                                                   ## access the "data"-column
    .fillna('0xUNKN')                                              ## replace nan values with 0xUNKN 
    .str.findall(pat)                                              ## use regex to extract patterns
    .apply(lambda x: dict((k, v) for v, k in x if (int(v)!=0)))    ## row-wise create dict to construct final {column: count} structure
    .apply(pd.Series)                                              ## use dict to create columns
    .fillna(0)                                                     ## replace NaN values with 0
)

A. Each operation with explanation

Here I explain what each operation does and finally reorder the columns alphabetically.

Regex Explanation: example-1

Find a detailed explanation of how the regex (\d+)x(\w+)\s*,\s*(\d+)x(\w+) works to extract various intended parts from the input text here: example-1.

# NOTE: I am using the dataframe that I created in 
# the Dummy Data section "below"

df2 = (df.data                                      # access the "data"-column
    .str.findall('(\d+)x(\w+)\s*,\s*(\d+)x(\w+)')   # use regex to extract patterns
    .explode()                                      # explode each rows' list into columns
    .apply(lambda x: {x[1]: x[0], x[3]: x[2]})      # row-wise create dict to construct final {column: count} structure
    .apply(pd.Series)                               # expand each cell into columns
    .fillna(0)                                      # replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1)      # alphabetically reorder columns
print(df2)

Output:

  AAAA BBBB CCCC
0    1    1    0
1    1    2    0
2    1    0    1

B. A more generic solution

If you have more than two types per line (say, AAAA, BBBB, CCCC), the following solution will work in that case as well.

Regex Explanation: example-2

Find a detailed explanation of how the regex (?:\s*(\d+)x(\w+)\s*)+ works to extract various intended parts from the input text here: example-2.

Code

import pandas as pd

## Dummy Data
data = [
    '1xAAAA,2xBBBB,3xDDDD', 
    '1xBBBB,1xAAAA,6xEEEE', 
    '1xAAAA,1xCCCC,3xDDDD', 
]
df = pd.DataFrame(data, columns=['data'])
print('\n Input:')
print(df)
## Output:
#                    data
# 0  1xAAAA,2xBBBB,3xDDDD
# 1  1xBBBB,1xAAAA,6xEEEE
# 2  1xAAAA,1xCCCC,3xDDDD

## Process DataFrame
# define regex pattern    
pat = '(?:\s*(\d+)x(\w+)\s*)+' # regex search pattern
# create dataframe in the expected format
df2 = (df.data                                      ## access the "data"-column
    .str.findall(pat)                               ## use regex to extract patterns
    .apply(lambda x: dict((k, v) for v, k in x))    ## row-wise create dict to construct final {column: count} structure
    .apply(pd.Series)                               ## use dict to create columns
    .fillna(0)                                      ## replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1)      ## alphabetically reorder columns
print('\n Output:')
print(df2)

## Output:
#   AAAA BBBB CCCC DDDD EEEE
# 0    1    2    0    3    0
# 1    1    1    0    0    6
# 2    1    0    1    3    0

C. Specific Example for Data Shared by the OP ⭐

Here is an example on the specific sample-data shared by the OP. This specific use-case shows the presence of nan values in the dataframe. As a strategy to use the previously suggested solution with minimal modification, you could just replace those nan values with a string 0xUNKN and then filter the results that do not start with a 0.

import pandas as pd

COLUMN_NAME = 'SmVariant'

## Dummy Data
data = [
    '1xFBBC', float('nan'), 
    '2xFBBA', '5xABIA', 
    '2xFBBC, 1xFBBA', 
    '1xFBBA', '4xABIA', 
    '1xFBBA, 1xFBBC', 
    float('nan'), '1xFBBA', 
    '3xFBBA, 1xFBBC', 
]
df = pd.DataFrame({COLUMN_NAME: data})
print('\n Input:')
print(df)
## Output:
#          SmVariant
# 0           1xFBBC
# 1              NaN
# 2           2xFBBA
# 3           5xABIA
# 4   2xFBBC, 1xFBBA
# 5           1xFBBA
# 6           4xABIA
# 7   1xFBBA, 1xFBBC
# 8              NaN
# 9           1xFBBA
# 10  3xFBBA, 1xFBBC

## Process DataFrame
# define regex pattern    
pat = '(?:\s*(\d+)x(\w+)\s*)+' # regex search pattern
# create dataframe in the expected format
df2 = (df[COLUMN_NAME]                                             ## access the "data"-column
    .fillna('0xUNKN')                                              ## replace nan values with 0xUNKN 
    .str.findall(pat)                                              ## use regex to extract patterns
    .apply(lambda x: dict((k, v) for v, k in x if (int(v)!=0)))    ## row-wise create dict to construct final {column: count} structure
    .apply(pd.Series)                                              ## use dict to create columns
    .fillna(0)                                                     ## replace NaN values with 0
)
df2 = df2.reindex(sorted(df2.columns), axis=1)                     ## alphabetically reorder columns
print('\n Output:')
print(df2)

## Output:
#    ABIA FBBA FBBC
# 0     0    0    1
# 1     0    0    0
# 2     0    2    0
# 3     5    0    0
# 4     0    1    2
# 5     0    1    0
# 6     4    0    0
# 7     0    1    1
# 8     0    0    0
# 9     0    1    0
# 10    0    3    1

D. Dummy Data

import pandas as pd

data = {
    '1xAAAA,2xBBBB', 
    '1xBBBB,1xAAAA', 
    '1xAAAA,1xCCCC', 
}
df = pd.DataFrame(data, columns=['data'])
print(df)
## Output:
#             data
# 0  1xBBBB,1xAAAA
# 1  1xAAAA,2xBBBB
# 2  1xAAAA,1xCCCC

References

@Wayne_AB Let me know if you have any questions. I hope this helps. — CypherX, Oct 15 '20 at 20:51
Thank you. I will look into the solution when I have some time. Could you please explain the regex a bit? — Wayne_AB, Oct 15 '20 at 22:29
@Wayne_AB I have added two examples in the solution: the first one will explain the first pattern step by step and the second one (which is a more concise and generic version of the first one), will explain you each steps in it. [example-1](https://regex101.com/r/UXhHRl/1) and [example-2](https://regex101.com/r/Ek90Mu/1/). — CypherX, Oct 17 '20 at 06:31
Here is a better sample data: df = pd.DataFrame({'SmVariant': ['1xFBBC', float('nan'), '2xFBBA', '5xABIA', \ '2xFBBC, 1xFBBA', '1xFBBA', '4xABIA', \ '1xFBBA, 1xFBBC', float('nan'), '1xFBBA', \ '3xFBBA, 1xFBBC']}) — Wayne_AB, Oct 17 '20 at 06:55
CypherX, if you have added ex1-3, I am very grateful. I think it is the right solution to handle variable length data. But I am getting an error I believe is related to NaN data. — Wayne_AB, Oct 17 '20 at 08:23
Thank you. I am 110% blown away by the response. I am an electronic engineer who has not done any coding for many years. I started with Pascal many moons ago, and more recently have used VisualBasic which is too weak and outdated for my purposes. This is a fantastic resource and I am extremely grateful. Thanks CypherX. — Wayne_AB, Oct 17 '20 at 09:07
@Wayne_AB I am glad the answer helped you. Please consider **`voting-up`** and **`accepting`** the answer. Thank you. — CypherX, Oct 18 '20 at 00:45