-3

I have a code like this.

 <td class="check ABCD" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}

<td class="check" rowspan="2"><center><div class="checkbox {{#if checked}}select{{else}}deselect{{/if}}" id="{{id}}" {{data "tool"}}>

And I want to extract only the class and ID name in the above code. I have very little knowledge about using regular expression in python.

How can I extract only the class name & id name(the ones in between "") using regular expression? or is there any better way to do this?. If yes, please help me finding it :)

Thanks in advance.

  • Does this have to be done with regex? – idjaw Mar 21 '16 at 06:06
  • @idjaw Is there any other way to extract it? Other than Regex? – Karthik Hegde Mar 21 '16 at 06:07
  • 1
    I don't know if this is part of a much bigger chunk of data. But this definitely looks like HTML, and if you are trying to parse through that, you should use something like [BeautifulSoup](https://pypi.python.org/pypi/beautifulsoup4) – idjaw Mar 21 '16 at 06:09
  • http://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup this should help you – ashishmohite Mar 21 '16 at 06:10
  • You can convert the dom element into BeautifulStoneSoup object and then get the attribute values – ashishmohite Mar 21 '16 at 06:11
  • @idjaw Basically this is the difference of two similar .stache files. I want to eliminate the unwanted data from this diff. Anyhow, I will look into BeautifulSoap. Thanks :) – Karthik Hegde Mar 21 '16 at 06:12
  • 5
    Obligatory: [You cannot parse XHTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Blckknght Mar 21 '16 at 06:13
  • @NEO-xx Thanks for the help. I will definitely look into it. I didn't know about BeautifulSoap ! – Karthik Hegde Mar 21 '16 at 06:14
  • @KarthikHegde: It is **Soup** not **Soap** (some programmers might mix both, though...) – Jan Mar 21 '16 at 06:50
  • @idjaw How do I parse using BeautifulSoup if those part of the code is string? I mean if it is stored in .txt file? – Karthik Hegde Mar 21 '16 at 09:49

1 Answers1

2

Since you asked for a Regex solution in Python, you'll get one:

import re
p = re.compile(ur'^.+?class="([^"]+)".+id="([^"]+)".+?$', re.MULTILINE)
test_str = u"<td class=\"check ABCD\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}\n<td class=\"check\" rowspan=\"2\"><center><div class=\"checkbox {{#if checked}}select{{else}}deselect{{/if}}\" id=\"{{id}}\" {{data \"tool\"}}>"

re.findall(p, test_str)

See live example over here: https://regex101.com/r/cG8dC5/1

Nevertheless, as some other users already noted. Regex isn't ideal for parsing (x)HTML. Better have a look at: https://pypi.python.org/pypi/beautifulsoup4

Community
  • 1
  • 1
netblognet
  • 1,951
  • 2
  • 20
  • 46
  • When you advise someone not to parse HTML with regex and post a solution nevertheless, isn't this somewhat lurking for rep ;) ? – Jan Mar 21 '16 at 06:25
  • 2
    Nope. It's the answer to the question. With an helpful advice. If you ask me: "how to make fire with a lens?" I'll answer you - hold it between the sun and some straw. Nevertheless, it's not the best way. Better use an lighter. So I think its a valid answer to his question and as long I told him, that there are better ways, this isn't lurking. – netblognet Mar 21 '16 at 06:31
  • I like the fire analogy (+1 for that), however I'd have bought him a lighter :) – Jan Mar 21 '16 at 06:34