0

I have records being read from a file which are strings of data that I'd like to break into sections. I new section always begins with <xxx> where xxx is any three alphabetic characters. Each section can be a different length.

Listed below is a sample snippet of the data

<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w

Regardless of the pattern I use, I can't get the string to break as i'd like. I either get the entire string, or just the section identifier (<xxx>) and the very next character.

Listed below are a few patterns that i've tried with the results immediately following:

matchLn1 = re.findall('(<\w{3}>.*)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')
['<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w']

matchLn1 = re.findall('(<\w{3}>.*?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>', '<BBB>', '<CCC>', '<DDD>']

matchLn1 = re.findall('(<\w{3}>.+?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>q', '<BBB>1', '<CCC>w', '<DDD> ']

matchLn1 = re.findall('(<\w{3}>.?)','<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w')<br/>
['<AAA>q', '<BBB>1', '<CCC>w', '<DDD> ']

I tried a few other patters as well, but the outcome was always the same. Any/all thoughts would be most welcome.

thank you

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Bill Morgan
  • 83
  • 1
  • 5

1 Answers1

0

You can use split() like this.

import re
text ='<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w'
p = re.compile("<\w{3}>")
print (p.split(text))

['', 'q2w *dc', '12sd', 'wer(4rf) q w ddcd', ' w erdfWED#2w']

Updated in response to comments: You can capture the separators as well like this:

import re
text ='<AAA>q2w *dc<BBB>12sd<CCC>wer(4rf) q w ddcd<DDD> w erdfWED#2w'
p = re.compile(r"(<[a-zA-Z]{3}>)")
print (p.split(text))

Outputs:

['', '<AAA>', 'q2w *dc', '<BBB>', '12sd', '<CCC>', 'wer(4rf) q w ddcd', '<DDD>', ' w erdfWED#2w']
Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
  • Add [this](https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings) logic. Also, put an `r` before the regex string. Nice logic though. You may also want to make `\w` `[a-zA-Z]` or `[a-z]` with `re.I` (I know OP wrote it that way, but they did specify `alphabetic`) – ctwheels Jan 05 '18 at 22:07
  • Thanks Dan, but i'd also like the section identifier in each list entry as well if that's posslbe – Bill Morgan Jan 05 '18 at 22:10
  • Thanks again. I guess I can concatenate the appropriate entries to get the section identifier and the associated text into a unique list item. – Bill Morgan Jan 05 '18 at 22:24