1

I have a dataframe column which contains product and technical details merged. I just want to split them separately into 2 columns like actual product name in one column and other technical details in one column.

I tried to solve the problem using regex and splitted the technical details separately, but the product name was going null wherever the technical details get splitted. not sure what went wrong.

This is the dataframe I tried
df = pd.DataFrame({'Description': ['WASHER tey DIN6340 10.5 C 35;', 'CABINET EL', 'CYLINDER SCREW', 'M12x N15']})

Code:
df['Technical Data'] = df['Description'].str.extract(r'^.*?(\s\w*\d+\w*\s.*)$')
df['Product Description'] = df['Description'].apply(lambda x: re.sub(r'^.*?(\w*\d+\w*\s.*)$', '', x))

The result I'm getting is enter image description here

So I want the output to be like this

enter image description here

Any suggestions on how to do that??

Pravin
  • 241
  • 2
  • 14

1 Answers1

2

You may capture any zero or more chars as few as possible ("Technical Data" column) and then optional whitespaces followed with an alphanumeric string and then anything till the end of the string (the "Product Description" column):

df[['Technical Data','Product Description']] = df['Description'].str.extract(r'^(.*?)(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))?$', expand=True)

See the regex demo.

Details:

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
  • (?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))? - an optional group matching
    • \s* - zero or more whitespaces
    • ((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*) - Group 2: either one or more letters and then a digit, or one or more digits and then a letter, and then any zero or more chars, other than line break chars, as few as possible
  • $ - end of string.

If you have Unicode letters, a common [^\W\d_] construct can help (you will need to replace [a-zA-Z] with [^\W\d_]).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563