0

I have a dataframe like this:

                               Id Column  Val1   Val2
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0
2                  Cust=acc hit,Data=125   3.0  400.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0
5               Cust=evc,Region Info=atz   2.0    NaN

I want to convert the dataframe to this:

                               Id Column  Val1   Val2     Cust Region Info   Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123.0
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124.0
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127.0
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz    NaN

From this other question, I got a partial answer.

But how can I handle the spaces in the key and value?

Edit: There may be multiple key-value pairs (other than the ones shown in the example). So I need to handle cases for any 'n' number of columns.

Asif Iqbal
  • 421
  • 3
  • 13
  • `There may be multiple key-value pairs (other than the ones shown in the example). So I need to handle cases for any 'n' number of columns.` I could see your edited message here. Its always recommended to post all question's requirement in your base question itself, because if you later change it users/future readers will be confused by it. – RavinderSingh13 Apr 15 '21 at 15:32

4 Answers4

3

Series.str.findall

We can use str.findall with regex capture groups to extract key-value pairs from the Id Column column

df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))

                               Id Column  Val1   Val2     Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz  NaN

Regex details

  • ([^=,]+): first capturing group
    • [^=,]+ : Matches any character not present in the list [=,] one or more times
  • = : Matches the = character literally
  • ([^,]+) : Second capturing group
    • [^,]+ : Matches any character not present in the list [,] one or more times

See the online regex demo

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
2

With your shown samples only, could please try following.

import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df

Here is the Online demo for used regex

Output will be as follows:

                                IdColumn  Val1   Val2 Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN         abc  xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0         abd  xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0     acc hit  NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0         abc  xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss  xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN         evc  atz  NaN

Explanation: Adding detailed explanation for above regex.

^Cust=              ##Checking if value starts from Cust= here.
([^,]+)             ##Creating 1st capturing group which has all values till , here.
(?:,Region Info=    ##Starting a non-capturing group , Region Info= here.
  ([^,]*)           ##Creating 2nd capturing group which has all values till , here.
)?                  ##Closing non-capturing group here.
(?:,Data=           ##Creating non-capturing group which has ,Data= here.
  (.*)              ##Creating 3rd capturing group which has all values till end of value here.
)?$                 ##Closing non-capturing group here at the end of line.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Use apply() on column Id Column and get the value by splitting.

df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)

                               Id Column  Val1   Val2 Cust Region
0      Cust=abc,Region Info=xyz,Data=123  0.0     NaN         abc
1      Cust=abd,Region Info=xyz,Data=124  1.0   750.0         abd
2                  Cust=acc hit,Data=125  3.0   400.0     acc hit
3      Cust=abc,Region Info=xyz,Data=126  NaN   200.0         abc
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss
5               Cust=evc,Region Info=atz  2.0     NaN         evc
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
1

Use list comprehension with split by , and then by = for list of dictionaries, so possible pass to DataFrame constructor:

L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
                               Id Column  Val1   Val2     Cust Region Info  \
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz   
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz   
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN   
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz   
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz   
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz   

  Data  
0  123  
1  124  
2  125  
3  126  
4  127  
5  NaN  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252