Specific regex name with all degrees in front of name and behind name

Question

After long hours of trying to figure out how to do special regex I realized I won't be able to solve this without any help, as long as i am novice in regular expressions. My task is to create regex which will extract names with degrees from HTML source code.

The website is here http://bacula.nti.tul.cz/~jan.hybs/ada/ where you can obviously find source code i need to create regex which will take all names with degrees. The output should look something like this - prof. Ing. Josef Novak, Ph. D. etc. - simply all things from Column called "Propojeni" should be extracted.

Order is important for me. (I am filling it to an Array list.)

I am able to write regex for any kind of different pattern, but not all of the patterns which are displayed in "propojeni".

I really appreciate any helping answer.

Can you show what you have already written, and how they fall short in solving this problem? — Scott Hunter, Dec 30 '16 at 23:40
Mandatory link: http://stackoverflow.com/a/1732454. Use HTML parser instead of regex. Jsoup is quite nice and supports CSS selectors. — Pshemo, Dec 30 '16 at 23:56
(Ing|doc|prof)\.\s[A-Z]([a-z]+|\\s[a-z]+) simply i can't come up with solution which will solve the chaining degrees at the beginning and at the end. — RickertBrandsen, Dec 30 '16 at 23:59
@Pshemo yes, but that was not my choice to be doing this with regex. — RickertBrandsen, Dec 31 '16 at 00:00
Whose choice was it, and why does whoever think it needs to be a regex? If this is for a work situation, then your manager should only care that the code does what it's supposed to do, not how you do it. Unless you're relying on some tool or library method that only accepts regexes, there should never be a _requirement_ to use a regex for any particular job. — ajb, Dec 31 '16 at 00:06
If multiple degrees is the only problem, you can use `+` like this: `((Ing|doc|prof)\.\s)+`. — ajb, Dec 31 '16 at 00:06
See like [this demo](https://regex101.com/r/ZDy3EZ/1) (for Java [try here, green button](http://fiddle.re/d47qqa)). — bobble bubble, Dec 31 '16 at 00:25

score 0 · Accepted Answer · edited May 23 '17 at 11:45

Proper solution shouldn't involve regex but XML/HTML parser like jsoup.

With this tool your code could look like:

Document doc = Jsoup.connect("http://bacula.nti.tul.cz/~jan.hybs/ada/").get();
Elements personel = doc.select("tr td:eq(1)"); 
for (Element person : personel){
    System.out.println(person.text());
}

select("tr td:eq(1)") tries to find all tr elements, and inside them td whose sibling index is equal to 1 (counting from 0). So if one tr has 3 td elements the middle one will be indexed with 1 and that is what we ware after.

Element#text() returns text which selected Element will represent, like <td><a link="foo"> bar </a></td> will be printed as bar in browser (with link decoration) and that is what text() will return.

But if you really MUST use regex (because someone is threatening you or your family) then one of ideas is not to focus on content itself, but on context which guarantees that content will be there. In your case it seems like you can look for <a href="/zamestnanec/SOME_NUMBER">CONTENT</a> and select CONTENT.

So your regex can look like

String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";

and all you will need to do is extract content of (.*?) (which is group 1).

So your code can look something like

String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(yourHtml);
while(m.find()){
    System.out.println(m.group(1));
}

? in (.*?) makes * reluctant, so it will try to find minimal possible match. This code will most likely work without that ? since . by default can't match line separators, but if your HTML would look like

<a href="..">foo</a><a href="bar">bar</a>

then (.*) for regex <a href="...">(.*)</a> would represent

<a href="..">foo</a><a href="bar">bar</a>
             ^^^^^^^^^^^^^^^^^^^^^^^^

instead of

<a href="..">foo</a><a href="bar">bar</a>
             ^^^

Specific regex name with all degrees in front of name and behind name

1 Answers1