2
html6="""
<p<ins style="background:#e6ffe6;">re><code</ins>>
int aint bint c<ins style="background:#e6ffe6;"></code></ins></p<ins style="background:#e6ffe6;">re</ins>><p>int d</p>
"""

Html6 and Html7 is the same , just Html7 has "\n"

html7="""
<p<ins style="background:#e6ffe6;">re><code</ins>>int a
int b
int c<ins style="background:#e6ffe6;">
</code></ins></p<ins style="background:#e6ffe6;">re</ins>>
<p>int d</p>
"""

p_to_pre_code_pattern = re.compile(
"""<p
<(?P<action_tag>(del|ins)) (?P<action_attr>.*)>re><code</(?P=action_tag)>
>
(?P<text>.*?)
<(?P=action_tag) (?P=action_attr)>
</code></(?P=action_tag)>
</p
<(?P=action_tag) (?P=action_attr)>re</(?P=action_tag)>
>""",re.VERBOSE)


print re.match(p_to_pre_code_pattern,html6)    
print re.match(p_to_pre_code_pattern,html7)

both html6 and html7 will not match ? ,but if i replace the "\n" to "" , it will much both .

print re.match(p_to_pre_code_pattern,html6.replace("\n",""))    
print re.match(p_to_pre_code_pattern,html7.replace("\n",""))

I want to know how should I change the p_to_pre_code_pattern that I will match both html6 and html7 without calling replace("\n","")) ?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
jianjun
  • 33
  • 7
  • I'm not too up-to-date with web stuff but would `beautiful soup` not be the tool for this? – Jeff Mar 02 '12 at 16:36
  • You need to add whitespace to the pattern: [This answer](http://stackoverflow.com/questions/4590298/how-to-ignore-whitespace-in-a-regular-expression-subject-string) seems relevant. – ChrisP Mar 02 '12 at 16:47

1 Answers1

1

Maybe you miss the re.DOTALL flag when call re.compile(..., re.VERBOSE|re.DOTALL)

re.S 
re.DOTALL 

Make the '.' special character match any character at all, including a newline;
without this flag, '.' will match anything except a newline.
kev
  • 155,172
  • 47
  • 273
  • 272