Extract a text out of a column using pattern in python

Question

I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>

My code is:

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'

df['result'] = df['col'].str.extract(pattern)[0]
print(df)

My output is:

col     result
0     item1 (30-10)      30-10
1   item2 (200-100)    200-100
2    item3 (100 FS)        NaN
3      item4 (100+)        NaN
4  item1 (1000-2000)  1000-2000

My output should be:

col     result        newcolumn
0       item1         (30-10)
1       item2         (200-100)
2       item3         (100 FS)
3       item4         (100+)
4       item1         (1000-2000)

Check this out https://stackoverflow.com/questions/9989334/create-nice-column-output-in-python — giorgos, Dec 19 '20 at 19:41

Cainã Max Couto-Silva · Answer 1 · 2020-12-19T19:43:17.097

You can use this:

df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')

Output:

                 col    newcolumn result
0      item1 (30-10)      (30-10)  item1
1    item2 (200-100)    (200-100)  item2
2     item3 (100 FS)     (100 FS)  item3
3       item4 (100+)       (100+)  item4
4  item1 (1000-2000)  (1000-2000)  item1

Explanation:

The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.

score 1 · Answer 2 · answered Dec 19 '20 at 19:42

You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = "(item\d*)\s(\(.*\))"

df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]

print(df)

Output:

                 col  items      result
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

buddemat · Accepted Answer · 2021-03-13T09:47:45.310

You can also do this with .str.split in a single line:

 df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)

Output:

                 col result    newcolumn
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).

EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:

 df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
 df = df.rename(columns={"col": "result"})

which exactly gives you the result you specified was intended:

  result    newcolumn
0  item1      (30-10)
1  item2    (200-100)
2  item3     (100 FS)
3  item4       (100+)
4  item1  (1000-2000)

Extract a text out of a column using pattern in python

3 Answers3