3

In Android/Java, given a website's HTML source code, I would like to extract all XML and CSV file paths.

What I am doing (with RegEx) is this:

final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
        "[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|].(xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
    urls.add(makeAbsoluteURL(url.group(0)));
}

public String makeAbsoluteURL(String url) {
    if (url.startsWith("http://") || url.startsWith("http://")) {
        return url;
    }
    else if (url.startsWith("/")) {
        return mRootURL+url.substring(1);
    }
    else {
        return mBaseURL+url;
    }
}

Unfortunately, this runs for about 25 seconds for an average website with normal length. What is going wrong? Is my RegEx just bad? Or is RegEx just so slow?

Can I find the URLs faster without RegEx?

Edit:

The source for the valid characters was (roughly) this answer. However, I think the two character classes (square brackets) must be swapped so that you have a more limited character set for the first char of the URL and a broader character class for all remaining chars. This was the intention.

Community
  • 1
  • 1
caw
  • 30,999
  • 61
  • 181
  • 291
  • Can you try: `final static Pattern urlRegex = Pattern.compile( "\\S+\\.(?:xml|csv)");` and see if it improves speed? – anubhava Sep 29 '13 at 10:15
  • Are you really sure that the Regex needs all the time? What size does `htmlString` have? To speed up the edged @FabioDch answer seems tp be the most elegant one. But I doubt that this part of the code needs so Mich time. – jboi Sep 29 '13 at 13:54
  • Yes, I've verified this. It depends on the HTML source. For some web pages, it finishes in a few seconds, but I've found some where parsing with this RegEx takes 20 seconds. I guess it's due to the backtracking, which may be "trapped" in some extreme cases for certain pages. – caw Sep 29 '13 at 15:07
  • Sorry, that I stick on that. I still cannot believe, that a Parser can take 25 seconds to do something in Memory. I haven't seen something like this for the last 10 years. Could you talk a bit more about your test? How much did it take to download the page versus parsing it? What's the size of the HTML string? – jboi Sep 29 '13 at 16:04
  • @anubhava: There is no reason why this should improve speed. In any way, the Pattern is only compiled once. I don't see why making it `static` should improve speed. When I measured the execution time, there was only one instance of that class either. – caw Sep 30 '13 at 01:16
  • @jboi: I did not include the download time in that measurement, of course. It is only parsing. One of the pages that takes about 15-25 seconds has 180KB (HTML only). But what is important is the structure and the contents of that file rather than its pure size. If I have a 10MB file which hardly requires backtracking, the RegExp will be finished quite fast. – caw Sep 30 '13 at 01:17
  • @MarcoW.: More than making it `static final` I wanted to know if pattern `"\\S+\\.(?:xml|csv)"` is making any difference in speed. Is it possible to ascertain whether only above code is taking more time? – anubhava Sep 30 '13 at 03:16
  • @anubhava: Sorry, didn't pay attention to that part. So you're proposing the non-whitespace class `\S` instead of the long list of valid characters, right? This may be a bit faster, but it is not correct. For example, given a link `abc`, the new RegEx would also match the `"`, and then the `=`, and then the word `href`, wouldn't it? – caw Sep 30 '13 at 03:25
  • Yes it would match those characters. I was thinking of going backwards to make our character class shorter. So if we have list that we don't want to match then we can do for example: `[^\s\"=<>]+` – anubhava Sep 30 '13 at 03:33
  • This way, the list will eventually become even longer, probably. The length of the list does not change a lot for the speed, either, I think. It was probably just what U Mad noted below (the enormous backtracking). – caw Sep 30 '13 at 03:39
  • Too be honest I have even longer character classes There is no reason for enormous backtracking. But just to prove/disprove a theory if you can just test with over simplistic (though not correct) `[^\s\"=<>]+` we'll know where is the slowness coming from. – anubhava Sep 30 '13 at 03:43

6 Answers6

4

Your regex is written in a way that makes it slow for long inputs. The * operator is greedy.

For instance for input: http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android.xml

The [-a-zA-Z0-9+&@#/%?=~_|!:,.;]* part of the regex will consume the whole string. It will then try to match the next character group, which will fail (since whole string is consumed). It will then backtrack in match of first part of the regex by one character and try to match the second character group again. It will match. Then it will try to match the dot and fail because the whole string is consumed. Another backtrack etc...

In essence your regex is forcing a lot of backtracking to match anything. It will also waste a lot of time on matches that have no way of succeeding.

For word forest it will first consume whole word in the first part of expression and then repeatedly backtrack after failing to match the rest of expression. Huge waste of time.

Also:

  • the . in regex is unescaped and it will match ANY character.
  • url.group(0) is redundant. url.group() has same meaning

In order to speed up the regex you need to figure out a way to reduce the amount of backtracking and it would also help if you had a less general start of the match. Right now every single word will cause matching to start and generally fail. For instance typically in html all the links are inside 2 ". If that's the case you can start your matching at " which will speed it up tremendously. Try to find a better start of the expression.

RokL
  • 2,663
  • 3
  • 22
  • 26
  • 1
    Thank you! First, the dot doesn't have have its special meaning inside of character classes, does it? http://stackoverflow.com/a/3675086/89818 Regarding the RegEx itself, I cannot make the start more specific. This is because I have to find relative paths (e.g. `folder/data.xml`), absolute paths (e.g. `/folder/data.xml`) and full URLs (e.g. `http://www.example.org/folder/data.xml`). You see, no possibility to make it specific. In my case, the end is what is really specific. Isn't that enough to reduce backtracking? Then I will inverse the string to search for `lmx.` at the beginning ;) – caw Sep 26 '13 at 23:36
  • So urls can be anywhere in the text? Not just in say `` elements? – RokL Sep 27 '13 at 07:48
  • No, because it is not only about clickable links but also about data sources such as in JavaScript snippets, in meta tags and so on. At least I thought it must be possible to find all URLs, given that I have the very specific ending of `\.(xml|csv)`. But if this was impossible to complete within 5 seconds via RegEx, I would probably have to restrict the search to `` elements. – caw Sep 27 '13 at 14:46
  • Oh well, I've just seen that you were probably talking about the second `.` before the file extension, right? I thought you were talking about the first dot in the character class. The second one, that you were referring to, must definitely be escaped, right? Maybe that will speed the search up a bit, already. – caw Sep 27 '13 at 14:47
  • Don't know how java/android handles laziness, but it may be worth a try. – atomman Sep 28 '13 at 21:54
  • Well, I've tried escaping the second dot, but this did not change anything, at least not significantly. It does still take a very long time to process the text with this expression. – caw Sep 29 '13 at 00:59
  • @MarcoW. Well for starter you can try using a reluctant quantifier, such as: `[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*?[-a-zA-Z0-9+&@#/%=~_|].(xml|csv)` Additionally if you know which characters can preface url you can use positive lookbehind to start the matching. E.g. let's say the character before the url can be either a space or a quote or a doublequote. `(?<=[\\s'\\"])[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*?[-a-zA-Z0-9+&@#/%=~_|].(xml|csv)` Alternatively you know that any regex cannot be prefaced by a letter or a number (e.g. abc.xsd, pointless to try to match bc.xsd which is prefaced by letter a +contd – RokL Oct 07 '13 at 11:46
  • because you know if axxxxxxxx didn't match as an url then xxxxxxx won't either. You can skip all matching of strings prefaced by a specific character with a negative lookbehind. I suggest you try this final regex and see if it's faster: `(?<![-a-zA-Z0-9+&@#/%=~_|])[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*?[-a-zA-Z0-9+&@#/%=~_|].(xml|csv)` Please try this regex, even if you've already found a solution, I'm really interested in how well it will do. – RokL Oct 07 '13 at 11:51
  • `(?<![-a-zA-Z0-9+&@#/%=~_|])[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]‌​\\.(xml|csv)` Edited to escape the dot and remove reluctant quant – RokL Oct 07 '13 at 12:03
3

I've nothing the say in the theoretical overview that U Mad did, he highlighted everything I'd noticed.

What I would like to suggest you, considering what are you look for with the RE, is to change the point of view of your RE :)

You are looking for xml and csv files, so why don't you reverse the html string, for example using:

new StringBuilder("bla bla bla foo letme/find.xml bla bla").reverse().toString()

after that you could look for the pattern:

final Pattern urlRegex = Pattern.compile(
    "(vsc|lmx)\\.[-a-zA-Z0-9+&@#/%=~_|][-a-zA-Z0-9+&@#/%?=~_|!:,.;]*");

urlRegex pattern could be refined as U Mad has already suggested. But in this way you could reduce the number of failed matches.

FabioDch
  • 516
  • 4
  • 8
  • Thank you! This is what came to my mind as well, so I suggested it here: http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android?noredirect=1#comment28137910_19022509 But I have no idea if this is really the solution to go with, as you will have to reverse the (quite long) HTML string at the beginning and reverse all the results again at the end. I don't know how this will affect performance. – caw Sep 29 '13 at 13:48
  • I did a test by myself, on a PC I've found here, and the test proved that solution was 2 or 3 times faster (even considering the initial phase of reversing). Btw I haven't tested it on a real or simulated android environment, finger crossed :) – FabioDch Sep 29 '13 at 14:05
  • In a real Android environment, this reduced the execution time for the RegEx parsing from 15-20 seconds to 3-5 seconds in those extreme cases where the problem with massive backtracking occurred. Just one new problem is raised: For the first character (which is now the last one), I have special requirements and only want to allow `[a-zA-Z0-9\\_\\/\\?]`. But `[full_char_set]*[reduced_char_set]` does not give any results, because `full_char_set` includes `reduced_char_set` and matches _everything_ due to the `*`. Any way to solve that problem? – caw Sep 30 '13 at 02:16
  • An easy solution that worth to be tried is the use of non-greedy star operator: [full]*?[reduced]. See http://docs.oracle.com/javase/tutorial/essential/regex/quant.html the major issue for this solution is the that the number of backtrack could increase a lot for long URLs – FabioDch Sep 30 '13 at 04:39
  • Sorry, don't give a try of my previous comment ... On a second thought my feeling is that It'll not work due to non-greedy operator. But check out the lookbehind matcher: http://www.regular-expressions.info/lookaround.html . Java should support it, i don't know if android does the same. – FabioDch Sep 30 '13 at 05:03
  • Android and Java do support lookarounds, but how should I use it? Lookbehinds do not work: "Match every `reduced_set_char` that is preceded by a `full_set_char`? That won't work. And lookaheads neither: "Match every `full_set_char` that is followed by a `reduced_set_char` only". Will not work because except the last one of the `full_set_chars`, they do not need to be followed by a `reduced_set_char`. – caw Sep 30 '13 at 18:05
  • 1
    I made a quickly run the UnitTest that @jboi did. I used this expression: "(lmx|vsc)\\.[-a-zA-Z0-9+&@#/%=~_|][-a-zA-Z0-9+&@#/%?=~_|!:,.;]*(?<=[a-zA-Z0-9\\_\\/\\?])". I think it is what are you look for, isn't it? – FabioDch Sep 30 '13 at 18:22
1

Would suggest only using the regex to find file extensions (.xml or .csv). This should be a lot faster and when found, you can look backwards, examining each character before and stop when you reach one that couldn't be in a URL - see below:

final HashSet<String> urls = new HashSet<String>();
final Pattern fileExtRegex = Pattern.compile("\\.(xml|csv)");
final Matcher fileExtMatcher = fileExtRegex.matcher(htmlString);

// Find next occurrence of ".xml" or ".csv" in htmlString
while (fileExtMatcher.find()) {
    // Go backwards from the character just before the file extension
    int dotPos = fileExtMatcher.start() - 1;
    int charPos = dotPos;
    while (charPos >= 0) {
        // Break if current character is not a valid URL character
        char chr = htmlString.charAt(charPos);
        if (!((chr >= 'a' && chr <= 'z') ||
              (chr >= 'A' && chr <= 'Z') ||
              (chr >= '0' && chr <= '9') ||
              chr == '-' || chr == '+' || chr == '&' || chr == '@' ||
              chr == '#' || chr == '/' || chr == '%' || chr == '?' ||
              chr == '=' || chr == '~' || chr == '|' || chr == '!' ||
              chr == ':' || chr == ',' || chr == '.' || chr == ';')) {
            break;
        }
        charPos--;
    }

    // Extract/add URL if there are valid URL characters before file extension
    if ((dotPos > 0) && (charPos < dotPos)) {
        String url = htmlString.substring(charPos + 1, fileExtMatcher.end());
        urls.add(makeAbsoluteURL(url));
    }
}

Small disclaimer: I used part of your original regex for valid URL characters: [-a-zA-Z0-9+&@#/%?=~_|!:,.;]. Haven't verified if this is comprehensive and there are perhaps further improvements that could be made, e.g. it would currently find local file paths (e.g. C:\TEMP\myfile.xml) as well as URLs. Wanted to keep the code above simple to demonstrate the technique so haven't tackled this.

EDIT Following the comment about effiency I've modified to no longer use a regex for checking valid URL characters. Instead, it compares the character against valid ranges manually. Uglier code but should be faster...

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • 1
    Thank you! The only weakness I could see is that this solution creates new `String` objects in the inner loop (which will be executed countless times). A small benchmark did verify this: While your solution is twice as fast as mine from the question, the other solution here (with inverted strings) is again twice as fast as this one. – caw Sep 30 '13 at 02:45
  • Interesting! Have now edited my answer to only use `char` and do the character range manually rather than regex/string - would be interested to know how the test performance now compares... – Steve Chambers Sep 30 '13 at 20:40
  • 1
    Well done! When the lists of valid chars match for both solutions (RegEx and yours), yours is now twice as fast as the RegEx solution. That means it is 4x as fast as your old version that created new String objects all the time in the loop. One can now decide if clarity (RegEx) or speed (yours) is more important. Thank you! – caw Oct 01 '13 at 01:53
  • Good to hear it! Thanks for reporting back the results. – Steve Chambers Oct 01 '13 at 08:30
1

I had my doubts, if there can be a String really long enough to take 25 seconds for parsing. So I tried and must admit now, that with about 27MB of text, it takes around 25 seconds to parse it with the given regular expression.

Being curious I changed the little test program with @FabioDch's approach (so, please vote for him, if you want to vote anywhere :-)

The result is quite impressing: Instead of 25 Seconds, @FabioDch's approach needed less then 1 second (100ms to 800ms) + 70ms to 85ms for reversing!

Here's the code I used. It reads text from the largest text file I've found and copies it 10 time to get 27MB of text. Then runs the regex against it and prints out the results.

@Test
public final void test() throws IOException {
    final Pattern urlRegex = Pattern.compile("(lmx|vsc)\\.[-a-zA-Z0-9+&@#/%=~_|][-a-zA-Z0-9+&@#/%?=~_|!:,.;]*");
    printTimePassed("initialized");

    List<String> lines = Files.readAllLines(Paths.get("testdata", "Aster_Express_User_Guide_0500.txt"), Charset.defaultCharset());
    StringBuilder sb = new StringBuilder();
    for(int i=0; i<10; i++) { // Copy 10 times to get more useful data 
        for(String line : lines) {
            sb.append(line);
            sb.append('\n');
        }
    }
    printTimePassed("loaded: " + lines.size() + " lines, in " + sb.length() + " chars");
    String html = sb.reverse().toString();
    printTimePassed("reversed");

    int i = 0;
    final Matcher url = urlRegex.matcher(html);
    while (url.find()) {
        System.out.println(i++ + ": FOUND: " + new StringBuilder(url.group()).reverse() + ", " + url.start() + ", " + url.end());
    }
    printTimePassed("ready");
}

private void printTimePassed(String msg) {
    long current = System.currentTimeMillis();
    System.out.printf("%s: took %d ms\n", msg, (current - ms));
    ms = current;
}
jboi
  • 11,324
  • 4
  • 36
  • 43
  • Thank you very much for testing! First, I proposed this approach here [1] because U Mad noted [2] that you need something fixed at the beginning, while I have only something fixed at the end. In general, shouldn't you use `System.nanoTime()` instead of `System.currentTimeMillis()` for benchmarks? (Probably, this won't make a difference here, though). Apart from that, impressing results! [1] http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android/19080733?noredirect=1#comment28137910_19022509 [2] http://stackoverflow.com/a/19022509/89818 – caw Sep 29 '13 at 17:16
  • My pattern is `(gpj|gepj)\\.[a-zA-Z0-9\\-\\.\\_\\~\\:\\/\\?\\[\\]\\@\\!\\$\\&\\(\\)\\*\\+\\,\\;\\=]+[a-zA-Z0-9\\_\\/\\?]` which works fine and increases the speed as intended. But unfortunately, the last group will never match because the one before includes the last one and matches everything (`*`). So I have to remove the last character class. But I want to enforce special requirements for the last (actually: first) character, how do I do it? – caw Sep 30 '13 at 02:20
  • @MarcoW. The second to nth character in the regex is a superset of the last character. Java should ensure that the last character matches the last `[]` in your Regex. Can you make an example with a test string? Maybe in a new question if there's not enough space in the comments or by editing your question. – jboi Sep 30 '13 at 06:16
  • You're right, I must have made some little mistake when I tried it the last time. Thank you! – caw Sep 30 '13 at 19:44
0

I know people love to use regex to parse html, but have you considered using jsoup?

Community
  • 1
  • 1
hooknc
  • 4,854
  • 5
  • 31
  • 60
  • I know that people advise against RegEx for HTML parsing, but as I don't want to parse the full HTML but only find certain words in it, it is fine here, I would say. Thanks for hinting at jsoup. I know it, but it comes with too much overhead if you just want to find URLs, which are not strictly related to HTML, anyway. I'm not parsing the HTML but trying to find URLs, which can also occur in JavaScript snippets etc. – caw Sep 29 '13 at 16:52
0

For sake of clarity I created a separate answer for this regex:

Edited to escape the dot and remove reluctant quant.

(?<![-a-zA-Z0-9+&@#/%=~_|])[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]‌\\​.(xml|csv)

Please try this one and tell me how it goes.

Also here's a class which will enable you to search a reversed string without actually reversing it:

    public class ReversedString implements CharSequence {
    public ReversedString(String input) {
        this.s = input;
        this.len = s.length();
    }
    private final String s;
    private final int len;
    @Override
    public CharSequence subSequence(final int start, final int end) {
        return new CharSequence() {
            @Override
            public CharSequence subSequence(int start, int end) {
                throw new UnsupportedOperationException();
            }

            @Override
            public int length() {
                return end-start;
            }

            @Override
            public char charAt(int index) {
                return s.charAt(len-start-index-1);
            }
            @Override
            public String toString() {
                StringBuilder buf = new StringBuilder(end-start);
                for(int i = start;i < end;i++) {
                    buf.append(s.charAt(len-i-1));
                }
                return buf.toString();
            }
        }; 
    }

    @Override
    public int length() {
        return len;
    }

    @Override
    public char charAt(int index) {
        return s.charAt(len-1-index);
    }

}

You can use this class as such:

pattern.matcher(new ReversedString(inputString));

RokL
  • 2,663
  • 3
  • 22
  • 26