I want the text between the last /th> and /tbody>

Question

This is what I use

    output = System.Text.RegularExpressions.Regex.Replace(output, "(?s)/th>(.*?)</tbody>", "$1")

Notice that I am using (.*?) because I want the search to be ungreedy. That is there are severals /th> around. I want to get rid texts above the LAST /th>

This is what I got.

<!-- statistics_period -->


<input name="subForm" type="hidden" value="1">
<input name="hidTotal" type="hidden" value="886">

<div class="domlistframe">
<div class="divMainListingTable">
<table width="76%" align="left" class="mainListTable" cellspacing="0" cellpadding="3">
    <tbody><tr>
                                                                        <th nowrap="">&nbsp;<               
                                                        <th colspan="4">&nbsp;</th>



        <th id="sercol" nowrap="" colspan="11">Totals</th>

You see? Several /th> there.

Yes I know full well the horrible consequences of parsing html with regular expression as described here RegEx match open tags except XHTML self-contained tags.

I am parsing mostly table anyway. It's working

Note: here is a simpler problem that's equivalent with above Say I have a text like this

cow cow cow chicken cat cow cat dog hello bla.

Say I want cat dog hello. That is text between the last cow and bla.

What would be the regular expression for that?

Notice I want the text between the LAST cow and bla.

Doing it

cow.*bla

will give me the whole text

Doing it cow.?*bla should give me what I want. However, as you can see from the sample I uses, it didn't work.

in vb.net in a code. So output is a large html and I want a part of that html between the last /th and the the — user4951, Nov 09 '15 at 10:38

score 2 · Answer 1 · answered Nov 09 '15 at 10:48

2

HINT

Try the pattern:

.*cow((?!cow).*?)bla

for the cow..bla problem.

The leading .* skips everything until the last cow is encountered

answered Nov 09 '15 at 10:48

hjpotter92

78,589
36
144
183

What does (?!cow) match? – user4951 Nov 09 '15 at 11:05
@JimThio It forces the captured group `((?!cow).*?)` to not have the word cow in it. `(?!` defines a negative lookahead. – hjpotter92 Nov 09 '15 at 11:05
Actually that's what I am looking for. It turns out it's not even necessary. Any reference what ?! is. – user4951 Nov 09 '15 at 11:15

score 0 · Answer 2 · answered Nov 09 '15 at 11:19

This is only a partial answer. Basically I solved the problem by using the technique hjpotter92 uses.

What I did is

    output = System.Text.RegularExpressions.Regex.Replace(output, "(?s).*/th>(.*?)</tbody>", "$1")

Because the first .* is greedy. It will automatically match the maximum string that contains .*th>

Some question remains. Why my original code doesn't work?

I suspect it has something to do with regular expression works from left to right. Again any input would be fine.

I would also thank htpotter for telling me what complement operator in regex is.

Hmmm... Well, this answer does answer the question of what should I do to make it work and now it's working. However, it's based on other answer. Which one I should pick as answer?

I want the text between the last /th> and /tbody>

2 Answers2

HINT