1

I have basically an RSS indexing app written in Python that stores the RSS content as a blurb in the DB. When the app initially processed the article contents, it commented out all links that didn't match certain criteria, for example:

<a href="http://google.com">Google</a>

Became:

<!--<a href="http://google.com">Google</a>--> Google

Now I need to process all these old articles and modify the links. So using BeautifulSoup 4 I can easily find the comments using:

links = soup.findAll(text=lambda text:isinstance(text, Comment))
for link in links:
    text = re.sub('<[^>]*>', '', link.string)
    # any html in the link tag was escaped by BS4, so need to convert back
    text = text.replace('&amp;lt;','<')
    text = text.replace('&amp;gt;','>')
    find = link.string + " " + text

The ouput of "find" above is:

<!--<a href="http://google.com">Google</a>--> Google

Which makes it easier to perform a .replace() on the content.

Now the problem I'm having (and I'm sure this is simple) is multi-line find/replacing. When Beautiful Soup initial commented out the links, some were converted to:

<!--<a href="http://google.com">Google
</a>--> Google

or

<!--<a href="http://google.com">Google</a>--> 
Google

So obviously, replace(old,new) won't work since replace() doesn't cover multi-lines.

Can someone help me out with a regex multi-line find/replace? It should be case-sensitive.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Joe
  • 1,762
  • 9
  • 43
  • 60
  • If you don't want to bother (and can do it without altering the content of the pages), the most obvious way to deal with this is to put all the HTML on one line, so that it becomes easy to perform replacements. – michaelmeyer Jul 01 '13 at 21:23
  • Take a look [here](http://stackoverflow.com/a/7173207/1002152) and [here](http://stackoverflow.com/a/587620/1002152) – marlenunez Jul 02 '13 at 00:48

1 Answers1

1

Try this:

 re.sub(r'pattern', '', link, flags=re.MULTILINE)

Regex matching is case sensitive per default.

If for some reason the RSS file becomes irregular, your script will fail. In that case you should consider using a proper parser, for instance lxml.

adbar
  • 131
  • 2