Proper solution shouldn't involve regex but XML/HTML parser like jsoup.
With this tool your code could look like:
Document doc = Jsoup.connect("http://bacula.nti.tul.cz/~jan.hybs/ada/").get();
Elements personel = doc.select("tr td:eq(1)");
for (Element person : personel){
System.out.println(person.text());
}
select("tr td:eq(1)")
tries to find all tr
elements, and inside them td
whose sibling index is equal to 1 (counting from 0). So if one tr
has 3 td
elements the middle one will be indexed with 1 and that is what we ware after.
Element#text()
returns text which selected Element will represent, like <td><a link="foo"> bar </a></td>
will be printed as bar
in browser (with link decoration) and that is what text()
will return.
But if you really MUST use regex (because someone is threatening you or your family) then one of ideas is not to focus on content itself, but on context which guarantees that content will be there. In your case it seems like you can look for <a href="/zamestnanec/SOME_NUMBER">CONTENT</a>
and select CONTENT
.
So your regex can look like
String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";
and all you will need to do is extract content of (.*?)
(which is group 1).
So your code can look something like
String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(yourHtml);
while(m.find()){
System.out.println(m.group(1));
}
?
in (.*?)
makes *
reluctant, so it will try to find minimal possible match. This code will most likely work without that ?
since .
by default can't match line separators, but if your HTML would look like
<a href="..">foo</a><a href="bar">bar</a>
then (.*)
for regex <a href="...">(.*)</a>
would represent
<a href="..">foo</a><a href="bar">bar</a>
^^^^^^^^^^^^^^^^^^^^^^^^
instead of
<a href="..">foo</a><a href="bar">bar</a>
^^^