1

ashamed as I am to admit it, I'm terrible with regex... so here I am to ask your help :)

i have an html file that looks sorta like this:

<table>
  <tr>
    <td sadf="a">
      <a href="">asdf</a>
    </td>
  </tr>
</table>

what I'd like to do, with Perl regex, is remove everything except for everything in the td tag. so i would want output to be this:

<td sadf="a">
  <a href="">asdf</a>
</td>

please help me out. Thanks

dolphy
  • 6,218
  • 4
  • 24
  • 32
Aelfhere
  • 171
  • 2
  • 12
  • 8
    [Some people have mild opinions about this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – dolphy Jun 22 '11 at 16:42
  • Read the 'Fools Rush In...' section for the regex tag wiki http://stackoverflow.com/tags/regex/info – mrk Jun 22 '11 at 22:42

3 Answers3

3

A html parser would be much better at this task, but if you insist on using a regular expression, try this:

<td[\s\S]*?</td>

It matches as few of any character as possible up until the end tag </td>.

agent-j
  • 27,335
  • 5
  • 52
  • 79
  • Out of curiosity, could this handle nested tables within the data element? – dolphy Jun 22 '11 at 17:01
  • 1
    @Dolphy. This regex would not support nested tables. I think we can all agree that an html parser would be better suited for that task. – agent-j Jun 22 '11 at 17:30
  • Just curious! I'm no great shakes at regular expressions (notice my lack of even trying to come up with an answer), and this is getting me learning about this whole greedy thing. Sometimes, quick and dirty does the job :D – dolphy Jun 22 '11 at 17:46
  • @Dolphy, a greedy match would get the entire nested table, but if there were multiple TDs, it would suck all the TDs as well. So, I chose lazy with the `*?`. I know the some flavors support depths (which could probably reliably parse x-html and xml), but I haven't looked into these. – agent-j Jun 22 '11 at 18:28
3

Try using XML::Simple. As others have pointed out, you can't use regex for parsing XML.

XML::Simple will turn your HTML into a hash structure. From there, you can easily locate the "td" element, and copy the whole thing to another hash reference. Then, you can use XML::Simple to turn it back into HTML.

XML::Simple can't guarantee the same structure in XML (although it'll be pro-grammatically the same). However, I rarely have problems with turning HTML into a hashref and back into HTML.

David W.
  • 105,218
  • 39
  • 216
  • 337
1

A simpler way of thinking of this is that you want to grab the tag part with a regular expression (rather than remove everything except the tag part).

In this case, the regular expression is simple, and would probably look something like this for the first line, for example: <td \w+?="\w*"> (you can match \n to grab a multiline block). It's hard to answer without knowing exactly what is changing in your regex, but if you follow a reference like this one you should be fine.

In addition, it probably is best to do this without regex at all (using an HTML parser at all) if it's anything more than a limited, specific grab. I'll assume you know that you want to use regex, but there are really much better ways of doing this if you've got something more complicated than a very basic search pattern on your hands.

Dylnuge
  • 525
  • 1
  • 4
  • 12