1

I have a requirement wherein I have to extract content inside <raw> tag. For example I need to extract abcd and efgh from this html snippet:
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

I used this code in my python
re.match(r'.*raw.*(.*)/raw.*', DATA)

But this is not returning any substring. I'm not good at regex. So a correction to this or a new solution would help me a great deal. I am not supposed to use external libs (due to some restriction in my company).

Cœur
  • 37,241
  • 25
  • 195
  • 267
hantan
  • 11
  • 1
  • 2
  • 2
    for your enjoyment: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – colinmarc Apr 28 '11 at 06:38
  • not sure how often this question has been asked? –  Apr 28 '11 at 06:54
  • 1
    "I am not supposed to use external libs (due to some restriction in my company)" That's a very, very bad idea. Your company is wasting time and money having you trying to reinvent an existing, working, correct, widely-used solution. – S.Lott Apr 28 '11 at 10:25
  • Or the company is a school, and the assignment is a homework ;) – Cyril Duchon-Doris Jan 21 '15 at 13:10

2 Answers2

5

Your company really needs to rethink their policy. Rewriting an XML parser is a complete waste of time, there are already several for Python. Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

You really should be using one of those. No sense duplicating all of that work.

jeffcook2150
  • 4,028
  • 4
  • 39
  • 51
0

Using non greedy matching (*?) can do this easily, at least for your example.

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

krousey
  • 1,728
  • 15
  • 22
  • 1
    Sadly, it won't work in general. But it will appear to work for a while. – S.Lott Apr 28 '11 at 10:23
  • That's not a very helpful comment @S.Lott, what exactly are you referring to? – Rob Young Apr 28 '11 at 16:03
  • @Rob Young: A regular expression for a language like XML or HTML can be defeated by deeply nested tags of the same type. So a `` structure will defeat any RE. – S.Lott Apr 28 '11 at 16:46
  • @S.Lott I thought I was being cute in the remark, but I guess I should have been more explicit than just noting 'at least for your example.' – krousey Apr 30 '11 at 06:51
  • Not everyone gets the limitations of regular expressions. As you can see by the number of Stack Overflow questions from people trying to parse HTML or XML with RE's. – S.Lott Apr 30 '11 at 11:46