-1

This is the text:

<div class="center-content">  <h2> <a href="https://lapiedradesisifo.com/2019/11/04/la-silenciosa-linea-del-idioma-no-hablado/" class="l:3207185" > La silenciosa línea del idioma no hablado </a>

My code:

Pattern p = Pattern.compile("<div class=\"center-content\"> *<h2> <a.{10,200} >(.{50,200})</a>");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
while(m.find()){
    sb.append(m.group(1) + "\n");
}

System.out.println(sb.toString());

This is what I expected to be printed on the screen:

"La silenciosa línea del idioma no hablado"

But nothing is being printed, I really don't know why because I've tried it with similar examples and it works.

I gotta be honest, I got this regex with some help and I still don't really understand how it works, would really appreciate some help with this one.

  • 1
    Your regex captures 50 to 200 characters, but your expected result is less than 50 characters. – iakobski Nov 05 '19 at 22:20
  • [Don't parse HTML with Regex](https://meta.stackoverflow.com/questions/252385/why-do-parsing-html-with-regex-questions-come-up-so-often); [that's not something](https://stackoverflow.com/a/590789/740553) it can actually properly do. Avoid [baking in failure later](https://stackoverflow.com/a/1732454/740553) and instead use the right tool for the job, by using a proper HTML5-compliant parser like [JSoup](https://jsoup.org/). – Mike 'Pomax' Kamermans Nov 05 '19 at 22:31

2 Answers2

0

The "." does not match newlines by default. The html you want to parse seems to contain newlines.

You can use Pattern.compile("pattern",Pattern.DOTALL) to make "." match newlines too. Even with that, your regex will not match. You can use some online tester to find out what's wrong ("La silenciosa línea del idioma no hablado" is < 50 chars, new line in "center-content")

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#DOTALL https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile(java.lang.String,%20int)

Daminox
  • 313
  • 1
  • 8
0

As Mike pointed out in the comment, use a proper HTML parser for processing HTML input. However, if you are interested in how your regex works, I'll try to briefly describe it.

Current pattern

Your current pattern works as follows

<div class=\"center-content\"> - matches literally <div class="center-content">

*<h2> - matches any character between zero and unlimited times, followed by <h2>

<a.{10,200} > - matches <a followed by any character between 10 and 200 times, followed by character >

(.{50,200}) - this one matches any character between 50 and 200 times and captures it into a group. This is, by the way, what you access in your code by calling m.group(1)

</a> - matches </a> literally

Simplified version

However, if your goal is just to capture a text wrapped within a element, you can simplify your regex to <a\s+href=.*?>(.*?)</a> which works as follows:

<a\s+href= - matches<a href=

.*?> - matches URL part of a (any character between 0 and unlimited times, as few times as possible) element followed by >

(.*?) - captures anything in between > and < (as few times as possible) - call .group(1) to get it

</a> - matches </a>

jpact
  • 1,042
  • 10
  • 23
  • 1
    Oh I see, can't use the simplified version unfortunately because that is just part of the source code of a webpage so there's a lot of other things that are also wrapped within the element "a". But that doesn't matter anymore, now I get it and it's thanks to you, so yeah thank you very much! – I don't feel so good Nov 05 '19 at 22:50