-1

Hello I have a strings such as :

liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']

and I would like to split them at the Number_Number I tried :

for i in liste_to_split:
 i.split(r'(?<=[0-9])_')

and I got

['NW_011625257.1_0']
['scaffold1_3']
['scaffold3']

instead of

['NW_011625257.1'] ['0']
['scaffold1'] ['3']
['scaffold3']

does someone knows where is the issue ?

chippycentra
  • 3,396
  • 1
  • 6
  • 24

2 Answers2

2

You may use:

>>> import re
>>> liste_to_split=['NW_011625257.1_0','scaffold1_3','scaffold3']
>>> 
>>> for i in liste_to_split:
...     re.split(r'(?<=[0-9])_', i)
...
['NW_011625257.1', '0']
['scaffold1', '3']
['scaffold3']

Note use of re.split instead of string.split and using _ outside lookbehind assertion to make sure we are not splitting on a zero width match.


Based on OP's comment below it seems OP wants to do this splitting for a dataframe column. In that case use:

Assuming this is your dataframe:

>>> print (df)
             column
0  NW_011625257.1_0
1       scaffold1_3
2         scaffold3

Then you can use:

>>> print (df['column'].str.split(r'(?<=[0-9])_', expand=True))
                0     1
0  NW_011625257.1     0
1       scaffold1     3
2       scaffold3  None
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • and what if it is on dataframe ? I can still use re ? example ```df['column'].re.split(i.split(r'(?<=[0-9])_') ``` – chippycentra Oct 28 '20 at 10:29
  • Since you haven't shown a dataframe that's answer is based on code provided in question. But yes dataframe also has a split but in different form. Please update your question so that I can help further. – anubhava Oct 28 '20 at 10:33
  • You can use: `df['column'].str.split(r'(?<=[0-9])_')` or check updated answer. – anubhava Oct 28 '20 at 10:41
  • Please reclose the question, it is still a dupe of https://stackoverflow.com/questions/48919003/pandas-split-on-regex?rq=1 and as written now, still a dupe of https://stackoverflow.com/questions/13209288/split-string-based-on-regex – Wiktor Stribiżew Oct 28 '20 at 10:45
  • I rarely reopen dupe questions but it was one of them due to mismatch between title and question body. I have already requested OP to edit the question. Once question is edited I will revisit to check dupe with this link. – anubhava Oct 28 '20 at 10:50
1
l=['NW_011625257.1_0','scaffold1_3','scaffold3']

for i in l:
  f = i.split('_')
  print(f) 

output

['NW', '011625257.1', '0']
['scaffold1', '3']
['scaffold3']