1

SO, I am trying create a simple regex that matches the following string:

<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>

I have created the following regex:

<PRE>[.|[\n]]*</PRE>

yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.

Sorry about the formatting of this question.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
newToProgramming
  • 7,905
  • 3
  • 17
  • 8

3 Answers3

2

Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.

Community
  • 1
  • 1
Hank Gay
  • 70,339
  • 36
  • 160
  • 222
1

If you're going to parse HTML, please use lxml, as Hank proposed.

But for this regex to work, you need to change the [] to (). A | inside square brackets is interpreted as the symbol '|' and not as an OR operator.

Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:

m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)

outputs the string inside the PRE, without the < PRE >and< /PRE > themselves.

Ofri Raviv
  • 24,375
  • 3
  • 55
  • 55
0

The issue is that inside []'s the . is a period, not a match-anything dot; the | is a pipe, not an or; and the [ and ] are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.

What you will want to do is this:

m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)

.search() will look everywhere in the string for the match (.match() only checks the beginning of the string), and re.DOTALL (or re.S) will have the . match newlines as well.

If you don't want the <PRE> and </PRE> tags included, move the parentheses to surround the .*.

Ethan Furman
  • 63,992
  • 20
  • 159
  • 237