4

This is the format/example of the string I want to get data:

<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español  </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada  </a></span><br>          </div>

And this is the regular expression I'm using for it:

"pelicula/([0-9]*)'>([\\w\\s]*)</a>"

I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:

group(1) = 18313
group(2) = Subtitulada

But when I try to implement that regular expression in Java, it won't match anything. Here's the code:

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");              
            Matcher matcher = pattern.matcher(inputLine);            
            while(matcher.find()){
                    version = matcher.group(2);
                }
            }

What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!

_EDIT__

I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.

Pundia
  • 161
  • 11
  • How do you know that it's not matching? It looks like all you're doing is assigning the group to a variable (`version`). There's no output. Try this in your while loop, instead: `System.out.println(matcher.group());` – David Nov 16 '12 at 02:12
  • I do have a System.out.println, but I ommited it in the posted code for clarity. – Pundia Nov 16 '12 at 15:33
  • _consume from Java, it gets another sourcecode_ ? what and how exactly are you consuming in Java – mzzzzb Nov 17 '12 at 04:24

2 Answers2

2

Your regex is correct but it seems \w does not match ñ.

I changed the regex to

"pelicula/([0-9]*)'>(.*?)</a>"

and it seems to match both the occurrences. Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\a> See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.

@Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks

Community
  • 1
  • 1
mzzzzb
  • 1,422
  • 19
  • 38
1

If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".

There are two way to do this:

Use the "dot matches newline" regex switch (?s) in your regex:

Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");

or use the Pattern.DOTALL flag in the call to Pattern.compile():

Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • The input has several lines, is the source code of a page. Even using what you told me, it doesn't work. – Pundia Nov 16 '12 at 15:33
  • I'm searching regular expressions line by line. As I said in the original post, I search for other stuff and they are successful, but this one not, and the code is almost the same except for the pattern. I even checked if that pattern works and it does (not in java, but in a regular expression checker). – Pundia Nov 16 '12 at 15:38