-2

After long hours of trying to figure out how to do special regex I realized I won't be able to solve this without any help, as long as i am novice in regular expressions. My task is to create regex which will extract names with degrees from HTML source code.

The website is here http://bacula.nti.tul.cz/~jan.hybs/ada/ where you can obviously find source code i need to create regex which will take all names with degrees. The output should look something like this - prof. Ing. Josef Novak, Ph. D. etc. - simply all things from Column called "Propojeni" should be extracted.

Order is important for me. (I am filling it to an Array list.)

I am able to write regex for any kind of different pattern, but not all of the patterns which are displayed in "propojeni".

I really appreciate any helping answer.

RickertBrandsen
  • 201
  • 4
  • 15
  • Can you show what you have already written, and how they fall short in solving this problem? – Scott Hunter Dec 30 '16 at 23:40
  • Mandatory link: http://stackoverflow.com/a/1732454. Use HTML parser instead of regex. Jsoup is quite nice and supports CSS selectors. – Pshemo Dec 30 '16 at 23:56
  • (Ing|doc|prof)\.\s[A-Z]([a-z]+|\\s[a-z]+) simply i can't come up with solution which will solve the chaining degrees at the beginning and at the end. – RickertBrandsen Dec 30 '16 at 23:59
  • @Pshemo yes, but that was not my choice to be doing this with regex. – RickertBrandsen Dec 31 '16 at 00:00
  • 1
    Whose choice was it, and why does whoever think it needs to be a regex? If this is for a work situation, then your manager should only care that the code does what it's supposed to do, not how you do it. Unless you're relying on some tool or library method that only accepts regexes, there should never be a _requirement_ to use a regex for any particular job. – ajb Dec 31 '16 at 00:06
  • If multiple degrees is the only problem, you can use `+` like this: `((Ing|doc|prof)\.\s)+`. – ajb Dec 31 '16 at 00:06
  • See like [this demo](https://regex101.com/r/ZDy3EZ/1) (for Java [try here, green button](http://fiddle.re/d47qqa)). – bobble bubble Dec 31 '16 at 00:25

1 Answers1

0

Proper solution shouldn't involve regex but XML/HTML parser like jsoup.

With this tool your code could look like:

Document doc = Jsoup.connect("http://bacula.nti.tul.cz/~jan.hybs/ada/").get();
Elements personel = doc.select("tr td:eq(1)"); 
for (Element person : personel){
    System.out.println(person.text());
}

select("tr td:eq(1)") tries to find all tr elements, and inside them td whose sibling index is equal to 1 (counting from 0). So if one tr has 3 td elements the middle one will be indexed with 1 and that is what we ware after.

Element#text() returns text which selected Element will represent, like <td><a link="foo"> bar </a></td> will be printed as bar in browser (with link decoration) and that is what text() will return.


But if you really MUST use regex (because someone is threatening you or your family) then one of ideas is not to focus on content itself, but on context which guarantees that content will be there. In your case it seems like you can look for <a href="/zamestnanec/SOME_NUMBER">CONTENT</a> and select CONTENT.

So your regex can look like

String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";

and all you will need to do is extract content of (.*?) (which is group 1).

So your code can look something like

String regex = "<a href=\"/zamestnanec/\\d+\">(.*?)</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(yourHtml);
while(m.find()){
    System.out.println(m.group(1));
}

? in (.*?) makes * reluctant, so it will try to find minimal possible match. This code will most likely work without that ? since . by default can't match line separators, but if your HTML would look like

<a href="..">foo</a><a href="bar">bar</a>

then (.*) for regex <a href="...">(.*)</a> would represent

<a href="..">foo</a><a href="bar">bar</a>
             ^^^^^^^^^^^^^^^^^^^^^^^^

instead of

<a href="..">foo</a><a href="bar">bar</a>
             ^^^
Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269