I am hoping someone can help me with a regular expression in Python3 (3.6.2)

Question

I have records being read from a file which are strings of data that I'd like to break into sections. I new section always begins with <xxx> where xxx is any three alphabetic characters. Each section can be a different length.

Listed below is a sample snippet of the data

<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w

Regardless of the pattern I use, I can't get the string to break as i'd like. I either get the entire string, or just the section identifier (<xxx>) and the very next character.

Listed below are a few patterns that i've tried with the results immediately following:

matchLn1 = re.findall('(<\w{3}>.*)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')
['<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w']

matchLn1 = re.findall('(<\w{3}>.*?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>', '<BBB>', '<CCC>', '<DDD>']

matchLn1 = re.findall('(<\w{3}>.+?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>q', '<BBB>1', '<CCC>w', '<DDD> ']

matchLn1 = re.findall('(<\w{3}>.?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>q', '<BBB>1', '<CCC>w', '<DDD> ']

I tried a few other patters as well, but the outcome was always the same. Any/all thoughts would be most welcome.

thank you

Do you want to *include* the "section separator"? Or not? Please show the *expected* output. — Willem Van Onsem, Jan 05 '18 at 21:58
sorry, good point. What i'd like (what I hoped for) is: [‘q2w *dc’, ’12sd’, ‘wer(4rf) q w ddcd’, ‘ w erdfWED#2w')’] — Bill Morgan, Jan 05 '18 at 22:11

Dan-Dev · Accepted Answer · 2018-01-05T22:18:41.450

0

You can use split() like this.

import re
text ='<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w'
p = re.compile("<\w{3}>")
print (p.split(text))

['', 'q2w *dc', '12sd', 'wer(4rf) q w ddcd', ' w erdfWED#2w']

Updated in response to comments: You can capture the separators as well like this:

import re
text ='<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w'
p = re.compile(r"(<[a-zA-Z]{3}>)")
print (p.split(text))

Outputs:

['', '<AAA>', 'q2w *dc', '<BBB>', '12sd', '<CCC>', 'wer(4rf) q w ddcd', '<DDD>', ' w erdfWED#2w']

edited Jan 05 '18 at 22:18

answered Jan 05 '18 at 22:06

Dan-Dev

8,957
3
38
55

Add [this](https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings) logic. Also, put an `r` before the regex string. Nice logic though. You may also want to make `\w` `[a-zA-Z]` or `[a-z]` with `re.I` (I know OP wrote it that way, but they did specify `alphabetic`) – ctwheels Jan 05 '18 at 22:07
Thanks Dan, but i'd also like the section identifier in each list entry as well if that's posslbe – Bill Morgan Jan 05 '18 at 22:10
Thanks again. I guess I can concatenate the appropriate entries to get the section identifier and the associated text into a unique list item. – Bill Morgan Jan 05 '18 at 22:24

I am hoping someone can help me with a regular expression in Python3 (3.6.2)

1 Answers1