0

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I have to map multiple <td> data in a single <tr> using regex in python

for example

<tr>
  <td>data 1</td>
  <td>data 2</td>
  <td>data 3</td>
</tr>

I want to extract data1,data2, data3 using a single regular expression. And there can be any number of <td> tags.

Currently I'm using multiple regex ie first I'm mapping <tr></tr> and then <td></td>

Can I do it in single expression?

I want to achieve this using regex, so I can't use beautiful soup or other html parsers.

Community
  • 1
  • 1
borngold
  • 69
  • 2
  • 6
  • 7
    you should parse html with html parser and not with regex – zenpoy Sep 04 '12 at 14:47
  • 2
    as long as you can guarantee no nested tags you are probably alright...but regex _CANNOT_ match nested items correctly... – Joran Beasley Sep 04 '12 at 14:48
  • 3
    *"I want to achieve this using regex, so I can't use beautiful soup or other html parsers."*. Sorry, but that's a bit like asking for help stabbing out your eyes with a fork. You generally do not want to do that, because it's not a good idea! Do you have any more compelling reasons to not want to use a HTML parser? – Martijn Pieters Sep 04 '12 at 14:50
  • "i want to achieve this using regex , so i can't use beautiful soup or other html parsers." Why is using a regex so important for you? You **really** shouldn't do it. Please read *[this](http://stackoverflow.com/a/1732454/1248554)*! – BrtH Sep 04 '12 at 14:51
  • 4
    While @zenpoy is *absolutely* right, this question isn't actually a duplicate. The "close for exact duplicate" selection shouldn't be used just for linking to a question/answer that the OP should read. – David Robinson Sep 04 '12 at 14:51

2 Answers2

3

Although, as others have suggested, you should be parsing your HTML with something designed for the task, the following will work for a subset of cases:

re.findall(r'(?i)<td.*?>([^<]+)</td.*?>', input_str)

Depending on the format of your HTML input, you may need to convert it to a string before working with re.findall(). The following will read from file.html and store any matches in a list called data:

import re

fh = open('file.html', 'r')
input_str = fh.read()
data = re.findall(r'(?i)<td.*?>([^<]+)</td.*?>', input_str)
fh.close()
tojrobinson
  • 359
  • 2
  • 7
2

EDIT: I get it, The Pony, He Comes. I too spread the word every now and then, as I absolutely agree with the view. But that view seemed more than sufficiently expressed in the comments above, so I aimed simply to answer the literal question, "Can I do it in [a] single expression?" With a simple: "No, except in .NET, so move along."


To answer your actual question:

No, you cannot do it in a single expression, unless you're using .NET, which, so I've heard, provides captures for each instance matched within a quantified expression.

The best you can do is a finite, non-arbitrary repetition, e.g.

 /<tr>(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?(?:\s*<td>(.*?)</td>)?\s*</tr>/

Of course, the above is crude and doesn't take into account any other tags, comments, etc. I only mean to exemplify the "finite, non-arbitrary" part.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145