How to map mutlitple tag in a tag using regex in python

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have to map multiple <td> data in a single <tr> using regex in python

for example

<tr>
  <td>data 1</td>
  <td>data 2</td>
  <td>data 3</td>
</tr>

I want to extract data1,data2, data3 using a single regular expression. And there can be any number of <td> tags.

Currently I'm using multiple regex ie first I'm mapping <tr></tr> and then <td></td>

Can I do it in single expression?

I want to achieve this using regex, so I can't use beautiful soup or other html parsers.

as long as you can guarantee no nested tags you are probably alright...but regex _CANNOT_ match nested items correctly... — Joran Beasley, Sep 04 '12 at 14:48
*"I want to achieve this using regex, so I can't use beautiful soup or other html parsers."*. Sorry, but that's a bit like asking for help stabbing out your eyes with a fork. You generally do not want to do that, because it's not a good idea! Do you have any more compelling reasons to not want to use a HTML parser? — Martijn Pieters, Sep 04 '12 at 14:50
"i want to achieve this using regex , so i can't use beautiful soup or other html parsers." Why is using a regex so important for you? You **really** shouldn't do it. Please read *[this](http://stackoverflow.com/a/1732454/1248554)*! — BrtH, Sep 04 '12 at 14:51
While @zenpoy is *absolutely* right, this question isn't actually a duplicate. The "close for exact duplicate" selection shouldn't be used just for linking to a question/answer that the OP should read. — David Robinson, Sep 04 '12 at 14:51

score 3 · Answer 1 · edited Sep 28 '12 at 06:56

Although, as others have suggested, you should be parsing your HTML with something designed for the task, the following will work for a subset of cases:

re.findall(r'(?i)<td.*?>([^<]+)</td.*?>', input_str)

Depending on the format of your HTML input, you may need to convert it to a string before working with re.findall(). The following will read from file.html and store any matches in a list called data:

import re

fh = open('file.html', 'r')
input_str = fh.read()
data = re.findall(r'(?i)<td.*?>([^<]+)</td.*?>', input_str)
fh.close()

Andrew Cheong · Answer 2 · 2012-09-04T16:16:47.633

EDIT: I get it, The Pony, He Comes. I too spread the word every now and then, as I absolutely agree with the view. But that view seemed more than sufficiently expressed in the comments above, so I aimed simply to answer the literal question, "Can I do it in [a] single expression?" With a simple: "No, except in .NET, so move along."

To answer your actual question:

No, you cannot do it in a single expression, unless you're using .NET, which, so I've heard, provides captures for each instance matched within a quantified expression.

The best you can do is a finite, non-arbitrary repetition, e.g.

 /<tr>(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?\s*</tr>/

Of course, the above is crude and doesn't take into account any other tags, comments, etc. I only mean to exemplify the "finite, non-arbitrary" part.

How to map mutlitple tag in a tag using regex in python

2 Answers2