0

It is simple.

I just want to extract some String values from unicode HTML source.

The original source looks like below:

<div id="encompass">
    <tr class="lineonoff">
                <td class="xsmall">27</td>
                <td>DATE</td>
                <td class="left">TITLE</td>
                <td>STATUS</td>
                <td><a href="javascript:viewData(ID, '')" class="button purple small"><span>A</span></a></td>
              </tr>
              <tr class="lineonoff">
                <td class="xsmall">28</td>
                <td>DATE</td>
                <td class="left">TITLE</td>
                <td>STATUS</td>
                <td><a href="javascript:viewData(ID, '')" class="button purple small"><span>B</span></a></td>
              </tr>
              <tr class="lineonoff">
                <td class="xsmall">29</td>
                <td>DATE</td>
                <td class="left">TITLE</td>
                <td>STATUS</td>
                <td><a href="javascript:viewData(ID, '')" class="button purple small"><span>C</span></a></td>
              </tr>
</div>

I want to extract TITLE, DATE,STATUS,ID.

I tried many possible variations of RegEx but failed at last..

 final Pattern pattern = Pattern.compile(PATTERN_STRING);
Matcher matcher = pattern.matcher(result.toString());

How can I extract those values? Thank you!

klados
  • 706
  • 11
  • 33
  • ...and particularly [its legendary answer](http://stackoverflow.com/a/1732454/115145). In short, don't use regular expressions. Parse the HTML with an HTML parser. A search on `java html parser` in a major search engine will turn up many options. – CommonsWare Apr 11 '15 at 17:47

1 Answers1

1

First, you should not use a regex to parse HTML. Prefer use a parser.

But after all that considerations, something dirty like that, could make the job:

<tr[\s\S]*?class\="left">([^<]*)[\s\S]*?<td>([^<]*)[\s\S]*?viewData\(([^\(]*),

https://regex101.com/r/lZ6rE0/1

Community
  • 1
  • 1
Gaël Barbin
  • 3,769
  • 3
  • 25
  • 52