-1

So I have next code to filter out all urls (just http) from page source (String text)

private synchronized void addLinks(String text) {

    String regex = "\\b(http)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern urlPattern = Pattern.compile(regex);

    Matcher matcher = urlPattern.matcher(text);
    while(matcher.find()) {

        int matchStart = matcher.start(1);
        int matchEnd = matcher.end();
        String urlStr = text.substring(matchStart, matchEnd);

        //do something
        }
    }
}

I need to add some code to the regex in order to match only urls that links to some text pages. Is it possible?

yaroslav
  • 863
  • 1
  • 8
  • 20
  • Can you clarify what you mean by "text pages"? – Aaron Jun 02 '16 at 09:22
  • As the URL itself does only kind of tell you what you are about to get you can not get it safely sorted by the url. Additionally the amount of possible endings you could face is really a lot. You can also get answers without endings. If you want to do a proper search you will have to retrieve the linked document and actually look at the content. As I don't know how precise your filter has to be nor how many URLs you expect I don't know big the impact on speed would be to open up all documents and look into them... – Pepich1851 Jun 02 '16 at 09:23
  • An effective but costly way to make sure you only retrieve text content would be to execute an HEAD request on the extracted URLs and check the response's header content-type. There is still a lot of different content-type to check for, but at least you won't be fooled by ressources able to generate different types of content. – Aaron Jun 02 '16 at 09:29
  • Didn't think of just asking for the header, that's a lot faster (I mean the header can still lie but at some point you gotta trust the server, right? :P) – Pepich1851 Jun 02 '16 at 09:32
  • Well even the file could lie, I remember editing the first two bytes of zip files to trick enterprise mail servers into accepting them. But then they're not too useful to the regular user, as would be a ressource with an incorrect content-type. – Aaron Jun 02 '16 at 09:38
  • Text pages: just reference to some text data, like html, css. I just need to filter out links with specified endings. – yaroslav Jun 02 '16 at 10:41
  • in that case it's easier to make a whitelist rather than a blacklist. Allow php, html, css, js, aspx and no exnding. Probably forgot some. If I'm not mistaken it should be solved by appending "(\.htm|\.css|\.js|\,php|\.aspx)?" to it but I am actually not sure... Additionally I will have forgotten a lot of possible endings as well. Just try around a bit :) – Pepich1851 Jun 02 '16 at 11:46
  • Just tried it:) What about `http://stackoverflow.com/questions/37585768/regex-to-find-all-urls-exluding-png-jpg-gif`? There are no extension but it links to other page with some text on it. – yaroslav Jun 02 '16 at 12:11

1 Answers1

0
public class NewC{
public static void main(String[] args) throws URISyntaxException {
   String URL_REGEX ="\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|].[^jpg][^png][^gif]$)";

    Pattern p = Pattern.compile(URL_REGEX);
    Matcher m = p.matcher(args[0]);//replace with string to compare
    if(m.find()) {//myw3schoolsimage
        System.out.println("String contains URL");
    }
}

}

  • Does not catch .mp3, .wav, .; You're only catching a few specific cases. The title states "excluding jpg png and gif" which you did but in the question he further specified that he exclusively wants text pages. I don't consider .mp3 to be a text page at all... - Just realised your regex check is also wrong, you are only checking for thee chars that are not any of the combinations of "j or p or g" as first char, "not p or n or i" as 2nd char and "not g not g not f" as third char. - And you didn't escape the dot. – Pepich1851 Jun 02 '16 at 09:17
  • I've tried code above it doesn't work. I also tried next code `String regex = "(?!.*(?:\\.jpe?g|\\.gif|\\.png)$)\\b((http)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])";` from http://stackoverflow.com/questions/11198001/match-all-urls-exclude-jpg-gif-png but again it doesn't solve my problem. – yaroslav Jun 02 '16 at 10:51