0

I am doing simple regular expressions in python

I am trying the re.split but things like ['\r\n', '\r\n'] are coming instead of the answer. Can someone please tell me how to display the actual text please?

I tried this statement:

t_html = re.split("<[a-zA-Z0-9\s\w\W]*>[a-zA-Z0-9\s\w\W]*</[a-zA-Z0-9\s\w\W]*>" ,s)

THanks

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
Lilz
  • 4,013
  • 13
  • 61
  • 95
  • 5
    uh, please post the regular expression you *tried* to use. – kenm Dec 02 '09 at 23:35
  • I am trying to get all the html tags and their contents...for example if I had this: "helloasfasdf" it would split it up as hello and asfasdf – Lilz Dec 02 '09 at 23:43
  • 2
    Don't use regex to parse html. use Beautiful Soup www.crummy.com/software/BeautifulSoup – John La Rooy Dec 02 '09 at 23:44
  • 2
    Consider what happens with real html where the tags are nested.
    some stuff
    more stuff
    still more stuff
    – John La Rooy Dec 02 '09 at 23:47
  • 2
    gnibbler is right. Use Beautiful Soup to parse HTML. Do not repeat do not attempt to use regular expressions to parse HTML. – steveha Dec 03 '09 at 00:49

2 Answers2

0

re.split by its very nature splits on the pattern but does not preserve it. If you want to return the string matched by the pattern you can put parentheses around the pattern: re.split((R),string) where R is your expression. If you want to say find all non overlapping matches use re.findall which will return a list. See here for more details and options.

fridder
  • 150
  • 5
0

If you want to use a regex to parse html, see here.

Community
  • 1
  • 1
Matt Anderson
  • 19,311
  • 11
  • 41
  • 57