regex in string pandas (split)

Question

Hello I have a strings such as :

liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']

and I would like to split them at the Number_Number I tried :

for i in liste_to_split:
 i.split(r'(?<=[0-9])_')

and I got

['NW_011625257.1_0']
['scaffold1_3']
['scaffold3']

instead of

['NW_011625257.1'] ['0']
['scaffold1'] ['3']
['scaffold3']

does someone knows where is the issue ?

**A duplicate of https://stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and https://stackoverflow.com/questions/13209288/split-string-based-on-regex** — Wiktor Stribiżew, Oct 28 '20 at 10:47

anubhava · Accepted Answer · 2020-10-28T10:44:02.267

2

You may use:

>>> import re
>>> liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']
>>> 
>>> for i in liste_to_split:
...     re.split(r'(?<=[0-9])_', i)
...
['NW_011625257.1', '0']
['scaffold1', '3']
['scaffold3']

Note use of re.split instead of string.split and using _ outside lookbehind assertion to make sure we are not splitting on a zero width match.

Based on OP's comment below it seems OP wants to do this splitting for a dataframe column. In that case use:

Assuming this is your dataframe:

>>> print (df)
             column
0  NW_011625257.1_0
1       scaffold1_3
2         scaffold3

Then you can use:

>>> print (df['column'].str.split(r'(?<=[0-9])_', expand=True))
                0     1
0  NW_011625257.1     0
1       scaffold1     3
2       scaffold3  None

edited Oct 28 '20 at 10:44

answered Oct 28 '20 at 10:27

anubhava

761,203
64
569
643

and what if it is on dataframe ? I can still use re ? example ```df['column'].re.split(i.split(r'(?<=[0-9])_') ``` – chippycentra Oct 28 '20 at 10:29
Since you haven't shown a dataframe that's answer is based on code provided in question. But yes dataframe also has a split but in different form. Please update your question so that I can help further. – anubhava Oct 28 '20 at 10:33
You can use: `df['column'].str.split(r'(?<=[0-9])_')` or check updated answer. – anubhava Oct 28 '20 at 10:41
Please reclose the question, it is still a dupe of https://stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and as written now, still a dupe of https://stackoverflow.com/questions/13209288/split-string-based-on-regex – Wiktor Stribiżew Oct 28 '20 at 10:45
I rarely reopen dupe questions but it was one of them due to mismatch between title and question body. I have already requested OP to edit the question. Once question is edited I will revisit to check dupe with this link. – anubhava Oct 28 '20 at 10:50

score 1 · Answer 2 · answered Oct 28 '20 at 10:39

1

l=['NW_011625257.1_0','scaffold1_3','scaffold3']

for i in l:
  f = i.split('_')
  print(f)

output

['NW', '011625257.1', '0']
['scaffold1', '3']
['scaffold3']

answered Oct 28 '20 at 10:39

regex in string pandas (split)

2 Answers2