0

I want to parse a web page and find specific patterns using regex on Python.

My Example page have:

<input type="checkbox" name="some name....">
<input type="text", name="somemore name...">
<input type="radio" name="other name...">

And i want to find all matcihng name values of radio and checkbox inputs.

<input type="checkbox" name="(.*?)".*?>
<input type="radio" name="(.*?)".*?>

But i can not figure out how to combine these to regex to a single one?

EDIT: That question might switch to other directions. But it is better for me to tell what i want to do and is my choice of regex usage really suitable for that...

I must query a subscriber and get some basic info about the subscriber and a list of available loans and charges of the sbscriber. RElated module has many scripts that do that kind of job with regex. I also use SGMLparser for some part in my code. But i sometimes see SGML parser fails to parse HTML (did not dig it why it fails but basic reason is unexpected char type errors). So, i must be sure that i van either handle all type of HTML code, or keep on doing this by regex.

CONCLUSION: It is the best choice to use HTMLParser, and using regex is simple a verry bad idea... That is what i get from this question... But since the Question itself is more about regex matcihng then regex usage in thml, i decided to accept the answer abour regex...

Mp0int
  • 18,172
  • 15
  • 83
  • 114
  • 1
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – outis Nov 21 '11 at 10:30
  • Just in case you decide to use xml parser, try `xml.dom.minidom` module, specially `getElementsByTagName` function and `attributes` attribute or `Attrs` method. – heltonbiker Nov 21 '11 at 10:47

3 Answers3

4
<input type="(checkbox|radio)" name="(?P<name>.*?)".*?>

I've also put a capture group name in there for ease of extraction.

But the old rule applies, don't use regex for parsing html. It's very fragile. What if the code you are parsing changed to be <input class="aha" type="checkbox" name="some name...."> overnight? Use the HTMLParser class or BeautifulSoup.

http://docs.python.org/library/htmlparser.html

http://www.crummy.com/software/BeautifulSoup/

Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
Joe
  • 46,419
  • 33
  • 155
  • 245
2

This?

<input type="(?:checkbox|radio)" name="(.*?)".*?>

While this works... It is not very robust...

FailedDev
  • 26,680
  • 9
  • 53
  • 73
2

You should never process HTML with Regex... there are plenty of threads here showing you why. Maybe you can check out this previous SO thread in which various HTML parsers for Python are discussed.

Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
  • Thank you, but what i need is more complex than i write in here. So that reason and some other reasons, my best choice is using regex. – Mp0int Nov 21 '11 at 10:37
  • 1
    Trust me. Trust us. Your need is almost certainly not unique. If it is, give us more details in the question (otherwise it may be closed as a duplicate) – Joe Nov 21 '11 at 10:42
  • May be its quite strict minded, but all job is done in that way up to today :D May be that why i do not want to change the basic structure of the sywstem. What i need to do is, query some __subscriber id__ and get some subscirber info and a list of his loans... – Mp0int Nov 21 '11 at 10:50
  • Yes but where is the data coming from? If this is getting data from internal software, you are adding brittleness to the product and incurring technical debt for the company. If it is getting data from outside you are compromising business continuity and opening yourself up to trouble down the road. – Joe Nov 21 '11 at 11:29
  • Data comes from outside of our system. something like browsing Stackoverflow and checking Questions and returning myself a list of questions i am interested in... – Mp0int Nov 21 '11 at 12:27
  • @FallenAngel: And tomorrow SO changes its web layout, and all your regexes suddenly fail to match. Or even better, they still match, but they match the wrong content. And even if it doesn't, someone will happen to provide a code sample that your regex matches... – Tim Pietzcker Nov 21 '11 at 14:19
  • @Tim Pietzcker: I guess thats the worst thing that could happen, but still its quite possible and dangerous – Mp0int Nov 21 '11 at 15:14
  • @FallenAngel: if the actual problem is more complex, all the more reason to use an HTML parser. A regex is the worse choice, not better. – outis Nov 22 '11 at 03:04