2

Consider this example, which I've ran on Python 2.7:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

tstr = r'''    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span><a
 id="Xtester"></a><span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>

'''

# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
print(  re.findall(regstr, tout2, re.DOTALL))             # finds
print("------") #
print(      re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?

When I run this - the first regex is replaced/sub'd as expected ( is gone); then in the output I get:

[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]

... which means that the second regex is written correctly (all three parts are found) - but then, when I try to replace all of that snippet with "AAAAAAA" - nothing happens in that part of output:

------
    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>

Clearly, there is no "AAAAAAA" here, as I'd expect.

What is the problem, and what should I do, to get sub to replace the matches that apparently have been found?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
sdaau
  • 36,975
  • 46
  • 198
  • 278
  • Thanks for the comment @Jerry - however, they are the same: first I'm calling `re.findall(regstr, ...` , then I'm calling `re.sub(regstr, ...`; the regex pattern being stored in a string `regstr` (that's why I put it in a variable in the first place). Cheers! – sdaau Jun 30 '14 at 12:54
  • Oh, oops. There were two different `re`s there and not I see them. – Jerry Jun 30 '14 at 16:33

5 Answers5

2

Why don't use an HTML parser for parsing and modifying HTML.

Example, using BeautifulSoup and replace_with():

from bs4 import BeautifulSoup

data = """Your html here"""
soup = BeautifulSoup(data)

for link in soup('a', id=True):
    link.replace_with('AAAAAA')

print(soup.prettify())

This replaces all of the links that have id attribute with AAAAAA text:

<div class="thebibliography">
<p class="bibitem">
<span class="biblabel">
 [1]
 <span class="bibsp">
 </span>
</span>
AAAAAA
<span class="cmcsc-10">
...

Also see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for that, @alecxe - but now that I've written a regex which I can see works, I'd like to know why cannot I use `re.sub` with the same. Cheers! – sdaau Jun 30 '14 at 13:00
  • 1
    @sdaau you are welcome, I understand, just take a look at [this famous thread](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) :) – alecxe Jun 30 '14 at 13:03
1

Your replacement doesn't work due to a misuse of the re.sub method, If you look at the documentation:

re.sub(pattern, repl, string, count=0, flags=0)

But in your code, you put the "flag" in the "count" place. This is the reason why, the re.DOTALL flag is ignored, cause it is at the wrong place.

Since you don't need to use the count param, you can remove the re.DOTALL flag and use an inline modifier instead:

regstr = r'''(?s)(<a.*?)(class=['"].*?['"])([\s]*>)'''

However, using something like bs4 is probably more convenient. (as you can see in @alecxe answer).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Fantastic, many thanks for that @CasimiretHippolyte !! Indeed, I could just have written `flags=re.DOTALL` in the OP code, and it would have worked! It's a shame I ran out of upvotes for today; will make sure to upvote some other time. Thanks again - cheers! – sdaau Jun 30 '14 at 13:35
1

It's quite simple : Python Standard Library Reference says syntax or re.sub is : re.sub(pattern, repl, string, count=0, flags=0). So your last sub is in fact (as re.DOTALL == 16):

re.sub(regstr, "AAAAAAA", tout2, count = 16, flags = 0 )

when you need :

re.sub(regstr, "AAAAAAA", tout2, flags = re.DOTALL )

and that last sub works perfectly ...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Thanks for that @SergeBallesta - indeed! I saw @CasimiretHippolyte's answer first, so I accepted that; will make sure I upvote here too, once I get some more `:)` Cheers! – sdaau Jun 30 '14 at 13:36
1

Problem is - your arguments were wrong.

Python 2.7 Source:

def re.sub(pattern, repl, string, count=0, flags=0):
     //code

Here, your argument re.DOTALL is being treated as count argument.

FIX: Use re.sub(regstr, "AAAAAAA", tout2, flags=re.DOTALL ) instead

Note: If you try using compile with your regex, sub works just fine.

Vinay Bhargav
  • 365
  • 2
  • 11
  • Thanks, @VinayBhargav - indeed, I just got informed of that; and I just posted my finding about compiled pattern a few minutes ago. Cheers! – sdaau Jun 30 '14 at 13:38
0

Well, in this case apparently, I should have used a compiled regex object (instead of going directly through the re. module call), and all seems to work (can even use backreferences) - but I still don't understand why the problem occurred at all? Would be good to learn why eventually... Anyways, this is the corrected code snippet:

# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
pat = re.compile(regstr, re.DOTALL)
#~ print(  re.findall(regstr, tout2, re.DOTALL))             # finds
print(  pat.findall(tout2))             # finds
print("------") #
# re.purge() # no need
print(      pat.sub(r'\1AAAAAAA\3', tout2, re.DOTALL )) # does nothing?

... and this is the output:

[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
------
    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" AAAAAAA ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>
sdaau
  • 36,975
  • 46
  • 198
  • 278