2

I am looking for every single URL, which is linked as "eye" in a html Document. I am using a regex pattern, because a simple contains is no solution at this point. So I got a pattern like this

Pattern:: href=\"(https?://)?[a-zA-z0-9?/&=\"+-_\\.# ]*>[Ee]ye

It works... fine... more or less... Because I get more than any URL linked as "Eye" or "eye". I'll get URLs which are linked as "eyebrights" or "eyewears", too, but that's not what I want.

Is there any way to say "get me this and ignore it, when there is more than I want"?

Pshemo
  • 122,468
  • 25
  • 185
  • 269
just_do_IT
  • 91
  • 1
  • 2
  • 10
  • 1
    To clarify, you want any URL whose text is exactly `Eye` or `eye`? Can you not match `` after eye? – T. Kiley Sep 01 '15 at 10:47
  • Umh... I'm not sure but it sounds... logically. Damn i should have tried something like this. I will try it, thanks! – just_do_IT Sep 01 '15 at 11:10
  • 1
    Should `eye` be first word in link description or can it be placed in the middle of text like `blue eye`? – Pshemo Sep 01 '15 at 11:13
  • eye should be the first word, yes and i tried the solution and it works, but i have some more cases where it's not enough :) So i preferred the \b solution :) – just_do_IT Sep 01 '15 at 11:25

2 Answers2

2

In should try to avoid using regex to parse XML/HTML. Use XML/HTML parser like jsoup instead . With this library our code could look like:

Elements links = doc.select("a[href]:matches(^[eE]ye\\b)");
//Elements extends ArrayList<Element> so you can easily iterate over it

more info at http://jsoup.org/cookbook/extracting-data/selector-syntax

Pshemo
  • 122,468
  • 25
  • 185
  • 269
1

Add \b after eye:

href=\"(https?://)?[a-zA-z0-9?/&=\"+-_\\.# ]*>[Ee]ye\\b

\b: assert position at a word boundary.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
Kerwin
  • 1,212
  • 1
  • 7
  • 14