0

My regular expression is

<source media="(min-width: 0px)" sizes="70px" data-srcset="(.*?)"/>

The text I'm testing my regex with is

<source media="(min-width: 0px)" sizes="70px" data-srcset="https://static2.therichestimages.com/wordpress/wp-content/uploads/2014/05/52f81afc8b39c.jpg?q=50&amp;fit=crop&amp;w=70&amp;h=70 70w"/>

It does not detect a URL inside the data-srcset attribute.

My code is

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex {

    private static final String IMG_PREFIX =
            "<source media=\"(min-width: 0px)\" sizes=\"70px\" data-srcset=\"";
    private static final String IMG_SUFFIX = "\"/>";

    public static void main(String[] args) {
        String line = "<source media=\"(min-width: 0px)\" sizes=\"70px\" data-srcset=\"https://static1.therichestimages.com/wordpress/wp-content/uploads/2012/06/Michael-Bloomberg.jpg?q=50&amp;fit=crop&amp;w=70&amp;h=70 70w\"/>";

        Pattern pattern = Pattern.compile(IMG_PREFIX + "(.*?)" + IMG_SUFFIX);
        Matcher matcher = pattern.matcher(line);

        System.out.println(matcher.find());

    }
}

Edit: the production code is using this HTML source rather than just a single line.

VLAZ
  • 26,331
  • 9
  • 49
  • 67
  • 1
    try this: (?<=data-srcset=\")(.*)(?= ) as your regex – IWHKYB Aug 01 '18 at 20:38
  • That would work for that particular line but I'm scraping a webpage and it uses data-srcset for adverts aswell. –  Aug 01 '18 at 20:44
  • [Parsing HTML with a regular expression is going to backfire on you.](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) Use a dedicated HTML parser instead. – VGR Aug 01 '18 at 23:13

1 Answers1

1

EDIT

Change your Pattern to:

String regex = "<source media=\"\\(min-width: 0px\\)\" sizes=\"70px\" data-srcset=\"(.+)\"/>";

Pattern pattern = Pattern.compile(regex);

The problem is that your Current Regex has Parenthesis as part of the "text" but they are not properly escaped since they are Regular Expression operators.

Specifically

(min-width: 0px)

Should be:

\(min-width: 0px\)

And in java land since you must escape a backslash:

\\(min-width: 0px\\)

Example:

public static void main(String[] args) {
    String line = "<source media=\"(min-width: 0px)\" sizes=\"70px\" data-srcset=\"https://static1.therichestimages.com/wordpress/wp-content/uploads/2012/06/Michael-Bloomberg.jpg?q=50&amp;fit=crop&amp;w=70&amp;h=70 70w\"/>\n";
    String regex = "<source media=\"\\(min-width: 0px\\)\" sizes=\"70px\" data-srcset=\"(.+)\"/>";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(line);
    while(matcher.find()) {
        System.out.println(matcher.group(1));
    }
}

The output I get:

https://static1.therichestimages.com/wordpress/wp-content/uploads/2012/06/Michael-Bloomberg.jpg?q=50&amp;fit=crop&amp;w=70&amp;h=70 70w
gtgaxiola
  • 9,241
  • 5
  • 42
  • 64
  • The reason why I was using the prefix and suffix is because I am scraping a webpage and it has advert pictures that display with the data-srcset attribute. –  Aug 01 '18 at 20:41
  • E.g. –  Aug 01 '18 at 20:41
  • But I wouldn't want my code to match that –  Aug 01 '18 at 20:42
  • I see, That is not just your only text you are seraching – gtgaxiola Aug 01 '18 at 20:42
  • The whole text is here view-source:https://www.therichest.com/top-lists/top-100-richest-celebrities/ –  Aug 01 '18 at 20:42
  • @RaeesSharif-Aamir updated answer – gtgaxiola Aug 01 '18 at 20:48
  • Unfortunately, that matches other images that aren't pictures of people. E.g. this All pictures of people have prefix IMG_PREFIX and suffix IMG_SUFFIX. i just can't tell what my code is doing wrong –  Aug 01 '18 at 20:53
  • @RaeesSharif-Aamir How can you determine an image will be of people through regex? [See THIS ANSWER](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – gtgaxiola Aug 01 '18 at 20:54
  • The images will be of people when source media="(min-width: 0px)" sizes="70px". I'm just not entirely sure how to make the code extract the link only when the prefix is matched. –  Aug 01 '18 at 20:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177231/discussion-between-raees-sharif-aamir-and-gtgaxiola). –  Aug 01 '18 at 20:58