1

I need to remove all the html tags from a given webpage data. I tried this using regular expressions:

import urllib2
import re
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString, Comment
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
description_tag = souprss.find_all('description')
content_tag = souprss.find_all('content:encoded')
print re.sub('<[^>]*>', '', content_tag)

But the syntax of the re.sub is:

re.sub(pattern, repl, string, count=0)

So, I modified the code as (instead of the print statement above):

for row in content_tag:
    print re.sub(ur"<[^>]*>",'',row,re.UNICODE

But it gives the following error:

Traceback (most recent call last):

File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module>
print re.sub(ur"<[^>]*>",'',row,re.UNICODE)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

What am I doing wrong?

Remy
  • 139
  • 1
  • 1
  • 10
  • Can you not find a minimal code example that also fails? For example, remove all non stdlib dependencies `bs4` unless they are crucial. If they are, then add a tag for them. This makes the question easier to answer and more useful. – Ciro Santilli OurBigBook.com Nov 13 '13 at 15:46
  • Have you seen [this answer](http://stackoverflow.com/a/1732454/1663352) – Noelkd Nov 13 '13 at 15:47
  • I know parsing HTML with RegEx is a sin, but umm, I really couldn't remove the tags any other way. Could you please suggest me a working method instead? :) – Remy Nov 13 '13 at 16:08

1 Answers1

1

Last line of your code try:

print(re.sub('<[^>]*>', '', str(content_tag)))
Qui
  • 108
  • 8
  • sorry - my code is written for python 3 try `print re.sub('<[^>]*>', '', str(content_tag))` – Qui Nov 13 '13 at 15:53