0

I have a html as string. I want to find every table element (open-close tags) with regex. I tried <table(.*?)>.*</table> pattern for it. But it doesn't work because, it matches something like between first table open tag and last table close tag.

Here is my code:

Pattern pattern = Pattern.compile("<table(.*?)>.*</table>");

and also I've tried:

Pattern pattern = Pattern.compile("<table(.*?)>.*</table>",Pattern.DOTALL);

Here is an instance:

    <table id="table1">
    </table>
    <table id="table2">
       <table id="table3">
       </table>
    </table>

My pattern finds all the elements between <table id="table1"> open tag and table2's close tag.

But I want it matches every table element with it's tag. For example: table1's open-close tags, table2's open-close tags..

Thanks for your answers.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
Veysel
  • 71
  • 1
  • 2
  • 10
  • Parsing HTML with regular expressions is considered bad practice. You should use a sophisticated HTML parser instead. See [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – vanje Dec 27 '17 at 12:19

2 Answers2

0

I think you have a bit logic problem, the following regex will give you each opening table tag in a group:

\s*(<table.*>)

although it can't match its closing tag, what you can do is just concatenate closing tag in the table child items, and for the parents just fix it manually

tomersss2
  • 135
  • 12
  • your pattern works fine on find each opening table tag but i dont understand how to match close tag. can you explain with an example? – Veysel Dec 27 '17 at 12:22
  • You don't need to match closing tags, just add them as is it as they are fixed, you can also look for them literally , regex can't find you the matching closing tag for each table if there is more than 1 level on paranting – tomersss2 Dec 27 '17 at 13:22
0

I think there is no good solution to your question. Because you can't parse HTML with a regex.

Take a look at this answer:

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

https://stackoverflow.com/a/1732454/2801860

fab
  • 1,189
  • 12
  • 21