What is a proper regex to find all the variations of the HTML using Java?

Question

I am trying to practice my skills by putting a formatted HTML table into a java matrix.

The problem is I am working with regexes and unfortunately they aren't working in the way I want.

For example, for the line:

<TD ALIGN="CENTER" colspan="14"><B class="useNavy">Computer Science</B><br></tr>

I am trying to "clean" the code by making TD ALIGN="CENTER" colspan="14" a plain td.

I use the following code where row contains that line:

row = row.replaceAll("<(td|TD)(.*)?>", "<td>");

I am expecting to get:

<td><B class="useNavy">Computer Science</B><br></tr>

But instead I get a single

<td>

What is wrong with my regex?

I thought I should tell the program to stop in the first match but it doesn't seem to work (replaceFirst) either.

I tried the following variations of the regex, but the same thing happens:

"<(td|TD).*>", "<(td|TD)(.*)>"

You should sharpen your regex skills on a tester site like http://regexpal.com/ or the like. You'll quickly see that your regex is too greedy. `.*` means to grab any character including the `>` character. — jmargolisvt, Sep 08 '15 at 03:07
Regex is such a difficult way to process html. Why not use an HTML parser instead? — e4c5, Sep 08 '15 at 03:19
[Don't Parse HTML With Regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — takendarkk, Sep 08 '15 at 03:28
What were you trying to accomplish with `(.*)?` ? `.*` means to match zero or more characters. `?` means that the pattern may or may not contain the thing you've applied it to, but it's useless here because the pattern will always contain zero or more characters at that point--there aren't any other possibilities! If you meant `?` as a "reclutant" qualifier, i.e. grab as few characters as necessary, then it has to go _inside_ the parentheses, directly after `*`. — ajb, Sep 08 '15 at 03:34
Everyone, I think it might be OK to use a regex in this case, since it only applies to an entity tag, and entity tags don't have a complex nested structure. Yes, we should discourage trying to use regexes on HTML in general. This may be one case where it's not a problem. — ajb, Sep 08 '15 at 03:35

score 1 · Accepted Answer · answered Sep 08 '15 at 03:36

1

<(td|TD)[^>]*> should grab all the td elements in your document.

[^>]* is the key part. It means "get as many characters as you find that aren't the closing greater than character".

answered Sep 08 '15 at 03:36

jmargolisvt

score 0 · Answer 2 · answered Sep 14 '15 at 23:19

0

use this simple regex pattern

String p="(\\.td\\.B\\sclass.*)";

Hope this helps

answered Sep 14 '15 at 23:19

james jelo4kul

2 Answers2