0

I need to get the value between href's double quotes(") that matches a specific pattern, I tried the above but I can't figure out what's wrong. When I find the pattern in the same line multiple times I get a huge group with information that I don't want:

href="(/namehere/nane2here/(option1|option2).*)"

I need the group between the parenthesis. This pattern repeats itself a lot of times in the string, they're all in the same line.

Example of a string I'm trying to get the values from:

<div>adasdsda<div>...lots of tags here... <a ... href="/name/name/option1/data1/data2"...anything here ...">src</a>...others HTML text here...<a ... href="/name/name/option2/data1"...
Wall-E
  • 438
  • 1
  • 6
  • 18
  • `[..]` is character set which allows us to match only *single* character among specified inside set. For instance if you have `[abc]` it will be able to match either single `a` or `b` or `c`, not `abc`. So instead of character set at `[option1|option2]` you ware probably looking for *group* like `(option1|option2)`. – Pshemo Jun 18 '20 at 22:01
  • @Pshemo, I tried it, but it didn't solved my problem. When I find the pattern in the same line multiple times I get a huge group with information that I don't want. – Wall-E Jun 18 '20 at 22:03
  • 1
    Change `.*` to `[^\"]*`. –  Jun 18 '20 at 22:14
  • is it not sufficient to capture all href (ie. `href=".+?"`, maybe even capture the url in a group) and then filter for what you're looking for. So 3 steps, pluck the urls, filter the urls, do your thingamaginga. – Pedro Rodrigues Jun 18 '20 at 22:16
  • Thanks, @saka1029, it seems to have solved my issue. – Wall-E Jun 18 '20 at 22:18

3 Answers3

1

First of all, don't use regex on entire HTML structure. To learn why visit:

Instead try to parse HTML structure into object representing DOM which will let us easily traverse over all elements and find those which we are interested in.

One of (IMO) easiest to use HTML parsers can be found at https://jsoup.org/. Its big plus is support for CSS selector syntax to find elements. It is described at https://jsoup.org/cookbook/extracting-data/selector-syntax where we can find

[attr~=regex]: elements with attribute values that match the regular expression; e.g.
img[src~=(?i)\.(png|jpe?g)]

In short [attr~=regex] will let us fund any element whose value of specified attribute can be even partially matched by regex.

With this your code can look something like:

String yourHTML =
        "<div>" +
        "   <a href='abc/def/1'>foo</a>" +
        "   <a href='abc/fed/2'>bar</a>" +
        "   <a href='abc/ghi/3'>bam</a>" +
        "</div>";
Document doc = Jsoup.parse(yourHTML);
Elements elementsWithHref = doc.select("a[href~=^abc/(def|fed)]");
for (Element element : elementsWithHref){
    String href = element.attr("href");
    System.out.println(href);
}

Output:

abc/def/1
abc/fed/2

(notice that there is no abc/ghi/3 since ^abc/(def|fed) can't be found in it)

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • The problem is that I cant use an HTML parser? eheheheh :). I believe I can build my own, but I can't do it in the time I have. Thank you very much for the justification. – Wall-E Jun 18 '20 at 22:48
  • @Wally Well, to be honest regex and HTML can work OK in case of simple HTML documents which structure is always the same (or at least you know it very well and can handle its traps). But generally it is safer option to use HTML parser. – Pshemo Jun 18 '20 at 22:57
0

Try "(?si)<[\\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\\s)href\\s*=\\s*(?:(['\"])\\s*((?:(?!\\1).)*?/namehere/nane2here/(?:option1|option2)(?:(?!\\1).)*)\\s*\\1))\\s+(?:\".*?\"|'.*?'|[^>]*?)+>"

demo

feature :

  • finds specific href value contained in any tag
  • group 1 contains delimiter
  • group 2 contains the href value
  • just uses regex to generally operate on _tags_, any that have the _`href="value"`_ inside. this regex is proven effective and is built from tag definitions from standard html. –  Jun 18 '20 at 22:53
0

\b is used to matche a word boundary

href="(/namehere/nane2here/(\\boption1\\b)|(\\boption2\\b).*)"
Mehdi Ziat
  • 71
  • 6