-4

I have a difficulty building a regex.
Suppose there is a html clip as below.
I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">) I construct the following expression :

/<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/g

But the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!! (I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( )

   <tbody id="text_0">
        <td class="by">
                ...lots of other tags
            <a href="xxx">cat</a>
               ...lots of other tags
        </td>
    </tbody>
    <tbody id="text_1">
               ...lots of other tags
        <td class="by">
            <a href="xxx">apple</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_2">
               ...lots of other tags
        <td class="by">
            <a href="xxx">cat</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_3">
               ...lots of other tags
        <td class="by">
               ...lots of other tags
            <a href="xxx">tiger</a>
        </td>
               ...lots of other tags
    </tbody>
    <tbody id="text_4">
        <td class="by">
            <a href="xxx">banana</a>
        </td>
    </tbody>
    <tbody id="text_5">
        <td class="by">
            <a href="xxx">peach</a>
        </td>
    </tbody>
    <tbody id="text_6">
        <td class="by">
            <a href="xxx">apple</a>
        </td>
    </tbody>
    <tbody id="text_7">
        <td class="by">
            <a href="xxx">banana</a>
        </td>
    </tbody>

And this is what i expect to get

<tbody id="text_1">
    <td class="by">
        <a href="xxx">apple</a>
    </td>
</tbody>
<tbody id="text_6">
    <td class="by">
        <a href="xxx">apple</a>
    </td>
</tbody>
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
zxiu
  • 1
  • 2
  • try putting it on regex101.com to see what is going wrong. for starters, the `text[\s\S]` doesn't make sense. – neuhaus Mar 02 '16 at 14:15
  • Oh, sorry , the condition also select the with id begins with "text". there are lots of other with other serial id, but i didn't put it in the question – zxiu Mar 02 '16 at 14:16
  • Before i post the question, I have tested with https://regex101.com/ and get the unexpected selection. I have no idea how to figure it out – zxiu Mar 02 '16 at 14:18
  • include the link to regex101.com in your question – neuhaus Mar 02 '16 at 14:18
  • See this question on SO for more information about *why* regex won't work: https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Nick Mar 02 '16 at 17:01

3 Answers3

0

Start with this working regexp and go from there:

/<a href="(.*?)">apple<\/a>/g

If that is too broad and you want to make it more specific, add the next surrounding tag:

/<td.*?>\s*<a href="(.*?)">apple<\/a>/g

Then continue:

/<tbody.*?>\s*<td.*?>\s*<a href="(.*?)">apple<\/a>/g

Also, consider an alternate solution such as XPATH. Regular expressions can't really parse all variations of HTML.

neuhaus
  • 3,886
  • 1
  • 10
  • 27
0

This is not an answer to the regex part of the question, but shouldn't the td elements be embedded in tr elements? tr stands for "table row", while tbody stands for "table body". tbody usually groups the table rows. It is not prohibited to have more than one tbody in the same table, but it is usually not necessary. (tbody is actually optional; you can have tr directly inside the table element.)

Tsundoku
  • 2,455
  • 1
  • 19
  • 31
  • The real html i am working on have the right structure of html. But it is very huge and i make the question simple. The difficult for me is i cant get the right selection by testing in https://regex101.com/ – zxiu Mar 02 '16 at 14:31
0

First, Regex is not a good solution for parsing anything like HTML or XML.

I can fix your pattern to work with this specific example but I can't guarantee that it will work in all cases. Regex just is not the right tool for the job.

But anyway, replace the first 2 instances of [\s\S] in your pattern with [^<].

<tbody.*?text[^<]*?<td class="by"[^<]*?<a.*?>apple<\/a>[\s\S]*?</tbody>

Nick
  • 4,556
  • 3
  • 29
  • 53
  • I tried with DOM and it works well, with the only problem...very slow and give my boss a bad feeling and make me feel same....When i tried with Regex and it is much much faster, without right response 8...(. The real html i works on could be few hundred kb. – zxiu Mar 02 '16 at 14:33
  • Thanks, it works like magic. I will try to learn from your answer and make it work with my real "html clip". Thank you very much again!! You saved my day! – zxiu Mar 02 '16 at 14:35
  • Sorry it is not working if there are other uncertain tags inside of , for example if the second part is like this will not be selected apple – zxiu Mar 02 '16 at 14:40
  • Yes, that is correct. Now you see why everyone is saying not use Regex. *Regular* expressions only work for *regular* languages and HTML is not a regular language. This is like trying to drive a nail into a board using a screwdriver instead of hammer. You're using the wrong tool for the job. – Nick Mar 02 '16 at 17:17
  • Agree. finally i used XPath slove the problem. To the speed it is a little slower as the Regex but much faster than DOM/JQuerySelector – zxiu Mar 03 '16 at 16:26