regular expression: what's wrong with my expression?

Question

I have a difficulty building a regex.
Suppose there is a html clip as below.
I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">) I construct the following expression :

/<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/g

But the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!! (I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( )

   <tbody id="text_0">
        <td class="by">
                ...lots of other tags
            <a href="xxx">cat</a>
               ...lots of other tags
        </td>
    </tbody>
    <tbody id="text_1">
               ...lots of other tags
        <td class="by">
            <a href="xxx">apple</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_2">
               ...lots of other tags
        <td class="by">
            <a href="xxx">cat</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_3">
               ...lots of other tags
        <td class="by">
               ...lots of other tags
            <a href="xxx">tiger</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_4">
        <td class="by">
            <a href="xxx">banana</a>
        </td>
    </tbody>
    <tbody id="text_5">
        <td class="by">
            <a href="xxx">peach</a>
        </td>
    </tbody>
    <tbody id="text_6">
        <td class="by">
            <a href="xxx">apple</a>
        </td>
    </tbody>
    <tbody id="text_7">
        <td class="by">
            <a href="xxx">banana</a>
        </td>
    </tbody>

And this is what i expect to get

<tbody id="text_1">
    <td class="by">
        <a href="xxx">apple</a>
    </td>
</tbody>
<tbody id="text_6">
    <td class="by">
        <a href="xxx">apple</a>
    </td>
</tbody>

try putting it on regex101.com to see what is going wrong. for starters, the `text[\s\S]` doesn't make sense. — neuhaus, Mar 02 '16 at 14:15
Oh, sorry , the condition also select the with id begins with "text". there are lots of other with other serial id, but i didn't put it in the question — zxiu, Mar 02 '16 at 14:16
Before i post the question, I have tested with https://regex101.com/ and get the unexpected selection. I have no idea how to figure it out — zxiu, Mar 02 '16 at 14:18
See this question on SO for more information about *why* regex won't work: https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Nick, Mar 02 '16 at 17:01

neuhaus · Answer 1 · 2016-03-02T14:30:33.390

0

Start with this working regexp and go from there:

/<a href="(.*?)">apple<\/a>/g

If that is too broad and you want to make it more specific, add the next surrounding tag:

/<td.*?>\s*<a href="(.*?)">apple<\/a>/g

Then continue:

/<tbody.*?>\s*<td.*?>\s*<a href="(.*?)">apple<\/a>/g

Also, consider an alternate solution such as XPATH. Regular expressions can't really parse all variations of HTML.

edited Mar 02 '16 at 14:30

answered Mar 02 '16 at 14:21

neuhaus

3,886
1
10
27

Yes, but i want to get the block from .... as the return of string.match(reg) – zxiu Mar 02 '16 at 14:24
well then add it to the regular expression.. as in putting `\s*\s*` in front of the starting regular expression I gave you. The point is you have to build them and start with something that works – neuhaus Mar 02 '16 at 14:25
This is a simpfied html clip. The target i am working on have lots of other tag between , and i would like to select the whole part of the , with the apple inside – zxiu Mar 02 '16 at 14:26
I think i must put something like id="text.*? behind (before and after ) i need the [\s\S]*? to include the line change – zxiu Mar 02 '16 at 14:28
As I mentioned, regular expressions are not ideal for this. – neuhaus Mar 02 '16 at 14:29
Should i use DOM for such html parse/analyse, Or could there be some better(faster) way? For it is related with the UI and with DOM it took me almost 2 second for a parsing, which make my boss a green face. – zxiu Mar 02 '16 at 14:48
The latest answer works very well in this case. I have to make a research for this knowledge and the XPATH you mentioned about. Thank everyone helped me! /\s*\s*apple<\/a>[\s\S]*?<\/tbody>/g – zxiu Mar 02 '16 at 14:57

score 0 · Answer 2 · answered Mar 02 '16 at 14:29

0

This is not an answer to the regex part of the question, but shouldn't the td elements be embedded in tr elements? tr stands for "table row", while tbody stands for "table body". tbody usually groups the table rows. It is not prohibited to have more than one tbody in the same table, but it is usually not necessary. (tbody is actually optional; you can have tr directly inside the table element.)

answered Mar 02 '16 at 14:29

Tsundoku

2,455
1
19
31

The real html i am working on have the right structure of html. But it is very huge and i make the question simple. The difficult for me is i cant get the right selection by testing in https://regex101.com/ – zxiu Mar 02 '16 at 14:31

score 0 · Answer 3 · answered Mar 02 '16 at 14:30

0

First, Regex is not a good solution for parsing anything like HTML or XML.

I can fix your pattern to work with this specific example but I can't guarantee that it will work in all cases. Regex just is not the right tool for the job.

But anyway, replace the first 2 instances of [\s\S] in your pattern with [^<].

<tbody.*?text[^<]*?<td class="by"[^<]*?<a.*?>apple<\/a>[\s\S]*?</tbody>

answered Mar 02 '16 at 14:30

Nick

4,556
3
29
53

I tried with DOM and it works well, with the only problem...very slow and give my boss a bad feeling and make me feel same....When i tried with Regex and it is much much faster, without right response 8...(. The real html i works on could be few hundred kb. – zxiu Mar 02 '16 at 14:33
Thanks, it works like magic. I will try to learn from your answer and make it work with my real "html clip". Thank you very much again!! You saved my day! – zxiu Mar 02 '16 at 14:35
Sorry it is not working if there are other uncertain tags inside of , for example if the second part is like this will not be selected apple – zxiu Mar 02 '16 at 14:40
Yes, that is correct. Now you see why everyone is saying not use Regex. *Regular* expressions only work for *regular* languages and HTML is not a regular language. This is like trying to drive a nail into a board using a screwdriver instead of hammer. You're using the wrong tool for the job. – Nick Mar 02 '16 at 17:17
Agree. finally i used XPath slove the problem. To the speed it is a little slower as the Regex but much faster than DOM/JQuerySelector – zxiu Mar 03 '16 at 16:26

regular expression: what's wrong with my expression?

3 Answers3