How to extract strings between two markers for each object of a list in python

Question

I got a list of strings. Those strings have all the two markers in. I would love to extract the string between those two markers for each string in that list.

example:

markers 'XXX' and 'YYY' --> therefore i want to extract 78665786 and 6866 

['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]

You can apply string slicing: `[s[s.index('XXX') + 3: s.index('YYY')] for s in lst]`. Or using [`re`](https://docs.python.org/3/library/re.html#re.search): `[re.search("XXX(.*)YYY", s).group(1) for s in lst]`. — Olvin Roght, Jul 27 '20 at 10:09
`3` is the length of `'XXX'` string, so we need to add it to not include `XXX` in result. — Olvin Roght, Jul 27 '20 at 10:15
Yes, `str.index()` returns index of **first char in substring**. — Olvin Roght, Jul 27 '20 at 10:20
i used that code for my exampe. i got following syntax: ValueError: substring not found. — derpaminontas_1992, Jul 27 '20 at 10:27
It means that one (?) of strings doesn't contain either `"XXX"` or `"YYY"`. — Olvin Roght, Jul 27 '20 at 10:28
import re x=['ATGCCAGCTTATTCAACCTCCGTATAATAGTGCTGTACTAAGCAAATTTATAGTTCTCTAGAAAGTGCCCGCGGTTATTCGGTGCAGTCTGGATCGGAAAG', 'ATGCCAGCTTATTCAACCACAACCACCATCAATGACAACAATCTCCAAGCACACTAGACGATCGCTTTCTGGGGTTATTCGGTGCAGTTAGATCGGAAGAG'] output = [] for item in x: output.append(re.search('ATGCCAGCTTATTCAACC(.*)GGTTATTCGGTGCAGTCT', item).group(1)) print(output) — derpaminontas_1992, Jul 27 '20 at 10:31
i got in both strings the two markers 'ATGCCAGC....' and 'GGTTATTC...' — derpaminontas_1992, Jul 27 '20 at 10:34
No, error message says that at least one of string isn't match. — Olvin Roght, Jul 27 '20 at 10:58

score 2 · Answer 1 · answered Jul 27 '20 at 10:12

2

You can just loop over your list and grab the substring. You can do something like:

import re

my_list = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
output = []
for item in my_list:
    output.append(re.search('XXX(.*)YYY', item).group(1))

print(output)

Output:

['78665786', '6866']

answered Jul 27 '20 at 10:12

Codesidian

310
2
12

i got following --> 'NoneType' object has no attribute 'group' – derpaminontas_1992 Jul 27 '20 at 10:29
1

@derpaminontas_1992, it means that there's no match of pattern in string, so it returned `None`. – Olvin Roght Jul 27 '20 at 10:31
@derpaminontas_1992 Could you comment what you've written? As long as your expression and your strings contain the same pattern, it should find your substring. – Codesidian Jul 27 '20 at 10:51
i found the mistake. my list after the following code contains two strings that dont have forw_primer and rev_primer in fasta_files: match=[p for p in fasta_file if forw_primer and rev_primer in p] the problem is that i dont know why i got those two more sequences that dont have the marker in. – derpaminontas_1992 Jul 27 '20 at 11:54
Use try and except. Make sure to log exactly what wasn't as expected and check that with your dataset and wherever you're getting that data from. – Codesidian Jul 28 '20 at 08:09

Himanshu Jagtap · Answer 2 · 2020-07-27T10:16:03.200

0

import re
l = ['XXX78665786YYYjajk', 'XXX6866YYYz6767'....]
l = [re.search(r'XXX(.*)YYY', i).group(1) for i in l]

This should work

edited Jul 27 '20 at 10:16

answered Jul 27 '20 at 10:15

Himanshu Jagtap

16
1
2

i got that for my example here --> AttributeError: 'NoneType' object has no attribute 'group' – derpaminontas_1992 Jul 27 '20 at 10:37
Can you paste here the list you are operating with. This error is result of absence of given pattern, i.e. 'XXX{some string}YYY'. – Himanshu Jagtap Jul 27 '20 at 10:58
i found the mistake. my list after the following code contains two strings that dont have forw_primer and rev_primer in fasta_files: match=[p for p in fasta_file if forw_primer and rev_primer in p] the problem is that i dont know why i got those two more sequences that dont have the marker in – derpaminontas_1992 Jul 27 '20 at 12:18

mpx · Answer 3 · 2020-07-27T10:20:12.557

0

Another solution would be:

import re
test_string=['XXX78665786YYYjajk','XXX78665783336YYYjajk']
int_val=[int(re.search(r'\d+', x).group()) for x in test_string]

edited Jul 27 '20 at 10:20

answered Jul 27 '20 at 10:17

mpx

3,081
2
26
56

1

It's definitely not the best regex patter you can use for this, there're too many cases where it can fail. – Olvin Roght Jul 27 '20 at 10:18

P.Rauser · Answer 4 · 2020-07-27T10:23:57.813

-1

the command split() splits a String into different parts.

list1 = ['XXX78665786YYYjajk', 'XXX6866YYYz6767']
list2 = []

for i in list1:
    d = i.split("XXX")
    for g in d:
        d = g.split("YYY")
        list2.append(d)

print(list2)

it's saved into a list

edited Jul 27 '20 at 10:23

answered Jul 27 '20 at 10:15

P.Rauser

1
1

How to extract strings between two markers for each object of a list in python

4 Answers4