Regex in Python

Question

SO, I am trying create a simple regex that matches the following string:

<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>

I have created the following regex:

<PRE>[.|[\n]]*</PRE>

yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.

Sorry about the formatting of this question.

Are you trying to just match that exact string type, or do you want to pull pieces of the string out? — ABach, Jun 02 '10 at 19:05
You have newlines in your string, so don't you need some "match across multiple lines" flag? — , Jun 02 '10 at 19:06
what dont you understand? he just wants the string between and including the
tags — Alex Gordon, Jun 02 '10 at 19:07
Were you attempting to match the string with the `
` tag in it, or was that only meant to be used for formatting? — John Rasch, Jun 02 '10 at 19:08
I tried to reformat this as best I could... The original was really confusingly formatted. Hope I didn't destroy the original meaning. — John Kugelman, Jun 02 '10 at 19:10
@every_answer: is there really a need to be so snarky? I was clarifying the OP's question; that doesn't make me an idiot. — ABach, Jun 02 '10 at 19:11
SO here is the clarification: I need it to match the string the includes the
and ends with the
. Doesn't my regex expression make sense? — newToProgramming, Jun 02 '10 at 19:17
there is random junk before the pre and after the pre that isn't important — newToProgramming, Jun 02 '10 at 19:18

score 2 · Answer 1 · edited May 23 '17 at 12:01

2

Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.

edited May 23 '17 at 12:01

Community

1
1

answered Jun 02 '10 at 19:08

Hank Gay

70,339
36
160
222

score 1 · Answer 2 · answered Jun 02 '10 at 22:38

If you're going to parse HTML, please use lxml, as Hank proposed.

But for this regex to work, you need to change the [] to (). A | inside square brackets is interpreted as the symbol '|' and not as an OR operator.

Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:

m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)

outputs the string inside the PRE, without the < PRE >and< /PRE > themselves.

score 0 · Answer 3 · answered Sep 07 '11 at 12:53

The issue is that inside []'s the . is a period, not a match-anything dot; the | is a pipe, not an or; and the [ and ] are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.

What you will want to do is this:

m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)

.search() will look everywhere in the string for the match (.match() only checks the beginning of the string), and re.DOTALL (or re.S) will have the . match newlines as well.

If you don't want the <PRE> and </PRE> tags included, move the parentheses to surround the .*.

Regex in Python

3 Answers3