0

I want to use Regular Expression (compatible with pcre) to select a table cell in an XML or HTML file.
This cell was expanded in several lines containing other elements and relative attributes and values. This
cell supposed to be at the last column.

for some reasons I can't and don't want to use ". matches newline" option.

for example in this code:
EDITED:

<table colcount="4">
<tr>
    <td colspan="2">
        <para><text> Mike</text></para>
    </td>
    <td>
        <tab />
    </td>
    <td1>
        <para><text>Jack</text></para>
        <para><text>Sarah</text></para>
    </td>
</tr1>
<tr>
    <td>
        <para><text>Bob</text></para>
        <para><text>Rita</text></para>
    </td>
    <td2 colspan="3" with>
        <para><text>Helen</text></para>
    </td>
</tr2>
<tr>
    <td style="with:445px;">
        <para><text>Sam</text></para>
    </td>
    <td>
        <para><text>Emma</text></para>
        <para><text>George</text></para>
    </td>
    <td>
    </td>
    <td3 colspan="">
        <tab />
    </td>
</tr3>
</table>

/EDITED

I want to find and select the whole last cell together with its start and end tags (<td and </td>)
and the end tag of the corresponding row(</tr>), that is:

EDITED:

Here is what I want to select in the table like above using RegEx:

Either from <td1 to </tr1> - or from <td2 to </tr2> - or from <td3 to </tr3>

/EDITED

The format (indentation and new lines have to be preserved), I mean I can't put, for example
</tr> in front of of closing tag of the cell(</td>).
Indentation is only space character.

Thanks for any help...

M. T.
  • 13
  • 5
  • 3
    Use a XML/HTML parser. Don't use regular expressions. Which language are you using? – Alex Filipovici Aug 28 '13 at 16:26
  • 1
    You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. http://stackoverflow.com/a/1732454/1906508 – revo Aug 28 '13 at 16:35
  • Actually I'm not using any specific language and so not trying to parse them. Unfortunately I'm not familiar of XSLT transformation very much. I searched and found some here, though up to now none worked as I wanted. If it helps; what I use now is Notepad++ and its NppExec plugin scintilla scripting. I have done some scripting but had the problem selecting just the last cell in last column with the cell's tags plus the closing tag of that row. – M. T. Aug 28 '13 at 20:39
  • As you have been told several times, regexes are the wrong tool for this. Use XSLT or whichever xml-parsing toolkit is available to your favourite language (e.g. gpath in Groovy). Regex matching will fail and fail again at this. If you are not familiar with any such tool, learn one. – itsbruce Aug 28 '13 at 22:32

1 Answers1

0

Best you can do with regex is:

<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>(?!(.|\r|\n)*<tr)

But this is kinda ugly, resource intensive and breaks when you have nested tables. A better route is indeed to use an XML or HTML parser for whichever programming language you're using.

If you want to select the last cell from EVERY row, as your updated question suggests, leave out the negative lookahead like so:

<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>

Working example here: http://refiddle.co/gt2

General Grievance
  • 4,555
  • 31
  • 31
  • 45
asontu
  • 4,548
  • 1
  • 21
  • 29
  • **Thank you** @funkwurm, your next solution was _excellent_ and it worked. You saved me a lot of time. I'll mark your answer as the final solution. Also I tried to check the example you had linked, but I couldn't access that and it said "_You are not authorized to access this page_", May be it's because of my current location, No matter, and I wanted to thank anyone else who tried to help me here. – M. T. Aug 29 '13 at 20:48