1

I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>

My code is:

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'

df['result'] = df['col'].str.extract(pattern)[0]
print(df)

My output is:

col     result
0     item1 (30-10)      30-10
1   item2 (200-100)    200-100
2    item3 (100 FS)        NaN
3      item4 (100+)        NaN
4  item1 (1000-2000)  1000-2000

My output should be:

col     result        newcolumn
0       item1         (30-10)
1       item2         (200-100)
2       item3         (100 FS)
3       item4         (100+)
4       item1         (1000-2000)
Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35
Khan
  • 39
  • 6

3 Answers3

2

You can use this:

df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')

Output:

                 col    newcolumn result
0      item1 (30-10)      (30-10)  item1
1    item2 (200-100)    (200-100)  item2
2     item3 (100 FS)     (100 FS)  item3
3       item4 (100+)       (100+)  item4
4  item1 (1000-2000)  (1000-2000)  item1

Explanation:

The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.

Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35
1

You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.

import pandas as pd
df = pd.DataFrame({
    'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})

pattern = "(item\d*)\s(\(.*\))"

df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]

print(df)

Output:

                 col  items      result
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)
Mig B
  • 637
  • 1
  • 11
  • 19
1

You can also do this with .str.split in a single line:

 df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)

Output:

                 col result    newcolumn
0      item1 (30-10)  item1      (30-10)
1    item2 (200-100)  item2    (200-100)
2     item3 (100 FS)  item3     (100 FS)
3       item4 (100+)  item4       (100+)
4  item1 (1000-2000)  item1  (1000-2000)

You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).

EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:

 df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
 df = df.rename(columns={"col": "result"})

which exactly gives you the result you specified was intended:

  result    newcolumn
0  item1      (30-10)
1  item2    (200-100)
2  item3     (100 FS)
3  item4       (100+)
4  item1  (1000-2000)
buddemat
  • 4,552
  • 14
  • 29
  • 49