4

Suppose that I got a brilliant idea of making a html link tag parser in order to explore the internet and i use a regex to parse and capture each occurrence of a link in a page. This code currently works fine, but I am seeking to add some members to reflect the "operation status".

public class LinkScanner {

    private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");

    public Collection<String> scan(String html) {
        ArrayList<String> links = new ArrayList<>();
        Matcher hrefMatcher = hrefPattern.matcher(html);
        while (hrefMatcher.find()) {
            String link = hrefMatcher.group(1);
            links.add(link);
        }
        return links;
    }
}

How I can measure this process?


For example : consider this an hypothetic measurement implementation...

 public class LinkScannerWithStatus {

    private int matched;
    private int total;

    public Collection<String> scan(String html) {
        ArrayList<String> links = new ArrayList<>();
        Matcher hrefMatcher = hrefPattern.matcher(html);
        total = hrefMatcher.getFindCount(); // Assume getFindCount exists
        while (hrefMatcher.find()) {
            String link = hrefMatcher.group(1);
            links.add(link);
            matched++; // assume is a linear measurement mechanism
        }
        return links;
    }
}

I don't know where to start.. I don't even know if the conjunction "Matcher processing" is grammatically valid :S

VLAZ
  • 26,331
  • 9
  • 49
  • 67
Victor
  • 3,841
  • 2
  • 37
  • 63
  • 1
    If you want a very sideways-thinking idea: implement a `CharSequence` interface and check which characters are requested from it to check the progress. Not sure it can be done cleanly though, if anybody calls `toString` on it you may lose track. If it can be done it would be my preferred solution. – Maarten Bodewes Jun 29 '15 at 20:32
  • OK, implemented this, but I'm not sure if it is good enough, may add another answer later, after some thought. – Maarten Bodewes Jun 29 '15 at 22:15
  • @MaartenBodewes Would be nice to see an example.. of course if you got time... i not see pretty well what i can do with a `CharSequence` on this case... although you give an idea, to know in what part of the html character the Mather is processing. There is method `hrefMatcher.end()`´ the returns the end index of the previous match..this plus knowing the entire size of the HTML (that can be know with a simple `html.length();` call.. i think that can be an inaccurate yet cheap solution to this case. – Victor Jun 30 '15 at 03:40

4 Answers4

2

Unfortunately Matcher doesn't have a listener interface to measure progress. It would probably be prohibitively expensive to have one.

If you have the full page as String instance then you can use region to select regions of the page. You can use this to scan these regions in sequence. Then you can report to the user which part you are currently scanning. You may have to backtrack a bit to allow overlap of the regions.

You could optimize if you backtrack by using hitEnd to check if a match was ongoing. If it wasn't then you don't need to backtrack.

One problem is that URL's are not really limited in size, so you need to make a choice what size of URL's you care to support.

If you create a good regular expression then you should not really have to report back the progress, unless you are processing truly huge files. Even in that case the I/O should have more overhead than the scanning for HTML anchors.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • well thanks Maarten, i like your ideas to workarounds this and your final advice about progress concern in these scenario.. nice to share that. I will try to follow these lines and tell how it was. – Victor Jun 29 '15 at 03:51
  • Note that this is a direct answer. There's no parsing XML or HTML with regex without the [default warning for Ponies](http://stackoverflow.com/a/1732454/589259). Pablo's answer reflects this sentiment. – Maarten Bodewes Jun 29 '15 at 18:51
  • Never heart of that in my life, until now, thanks for the data... i use regex and works pretty well, don't know if i am actually getting "all the links", but i am getting links. Confining myself to the original question.. if there is no way to measure a Matcher progress, i shall mark this answer as correct. – Victor Jun 29 '15 at 19:06
  • Regex is fine for just finding URL's, Victor, I'm just hinting at it because you will get answers indicating that you should parse HTML instead. Be aware that you don't expand into adding too much parsing to it though. That and it is a good laugh :) – Maarten Bodewes Jun 29 '15 at 19:07
  • 1
    @MaartenBodewes, that regex/html answer is probably the best one in the whole site. gotta love __bobince__ – Pablo Fernandez Jun 29 '15 at 20:22
2

Performance and memory issues aside, you can use a DOM parser to evaluate the HTML, that way, while you walk the DOM you can perform a given action.

Another possibility is to interpret the given HTML as XML and use SAX. This is efficient but assumes a structure that may not be there.

Pablo Fernandez
  • 103,170
  • 56
  • 192
  • 232
  • 1
    Ey Pablo Fernandez!! Thanks for the tip dear friend. Hope to enjoy a pizza one of these days. Is a good tips... will investigate. Thanks bro :D – Victor Jun 29 '15 at 18:15
1

As requested by Victor I'll post another answer. In this case CharSequence is implemented as a wrapper around another CharSequence. As the Matcher instance requests characters the CountingCharSequence reports to a listener interface.

It's slightly dangerous to do this as CharSequence.toString() method returns a true String instance which cannot be monitored. On the other hand, it seems that the current implementation is relatively simple to implement and it does work. toString() is called, but that seems to be to populate the groups when a match has been found. Better write some unit tests around it though.

Oh, and as I have to print the "100%" mark manually there is probably a rounding error or off-by-one error. Happy debugging :P

public class RegExProgress {

    // the org. LinkScanner provided by Victor
    public static class LinkScanner {
        private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");
        public Collection<String> scan(CharSequence html) {
            ArrayList<String> links = new ArrayList<>();
            Matcher hrefMatcher = hrefPattern.matcher(html);
            while (hrefMatcher.find()) {
                String link = hrefMatcher.group(1);
                links.add(link);
            }
            return links;
        }
    }

    interface ProgressListener {
        void listen(int characterOffset);
    }

    static class SyncedProgressListener implements ProgressListener {
        private final int size;
        private final double blockSize;
        private final double percentageOfBlock;

        private int block;

        public SyncedProgressListener(int max, int blocks) {
            this.size = max;
            this.blockSize = (double) size / (double) blocks - 0.000_001d;
            this.percentageOfBlock = (double) size / blockSize;

            this.block = 0;
            print();
        }

        public synchronized void listen(int characterOffset) {
            if (characterOffset >= blockSize * (block + 1)) {
                this.block = (int) ((double) characterOffset / blockSize);
                print();
            }
        }

        private void print() {
            System.out.printf("%d%%%n", (int) (block * percentageOfBlock));
        }
    }

    static class CountingCharSequence implements CharSequence {

        private final CharSequence wrapped;
        private final int start;
        private final int end;

        private ProgressListener progressListener;

        public CountingCharSequence(CharSequence wrapped, ProgressListener progressListener) {
            this.wrapped = wrapped;
            this.progressListener = progressListener;
            this.start = 0;
            this.end = wrapped.length();
        }

        public CountingCharSequence(CharSequence wrapped, int start, int end, ProgressListener pl) {
            this.wrapped = wrapped;
            this.progressListener = pl;
            this.start = start;
            this.end = end;
        }

        @Override
        public CharSequence subSequence(int start, int end) {
            // this may not be needed, as charAt() has to be called eventually
            System.out.printf("subSequence(%d, %d)%n", start, end);
            int newStart = this.start + start;
            int newEnd = this.start + end - start;
            progressListener.listen(newStart);
            return new CountingCharSequence(wrapped, newStart, newEnd, progressListener);
        }

        @Override
        public int length() {
            System.out.printf("length(): %d%n", end - start);
            return end - start;
        }

        @Override
        public char charAt(int index) {
            //System.out.printf("charAt(%d)%n", index);
            int realIndex = start + index;
            progressListener.listen(realIndex);
            return this.wrapped.charAt(realIndex);
        }

        @Override
        public String toString() {
            System.out.printf(" >>> toString() <<< %n", start, end);
            return wrapped.toString();
        }
    }

    public static void main(String[] args) throws Exception {
        LinkScanner scanner = new LinkScanner();
        String content = new String(Files.readAllBytes(Paths.get("regex - Java - How to measure a Matcher processing - Stack Overflow.htm")));
        SyncedProgressListener pl = new SyncedProgressListener(content.length(), 10);
        CountingCharSequence ccs = new CountingCharSequence(content, pl);
        Collection<String> urls = scanner.scan(ccs);
        // OK, I admit, this is because of an off-by one error
        System.out.printf("100%% - %d%n", urls.size());

    }
}
Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • jajaj Thanks Maarten. I like you idea, is very clever to track the marcher characters consumption through the "decorated" `CountingCharSequence` CharSequence.. Don't know how effective will be in terms of measurement, but i'll definitively study it and try it! – Victor Jul 01 '15 at 14:39
0

So, to measure your progress through a document, you want to find the total number of matches, then as you go match by match, you update the progress and add them to stored links LinkedList.

You can count the total number of matches using: int countMatches = StringUtils.countMatches(String text, String target);

So then, just look for the String "href" or maybe the tag or some other component of a link, then you will have a hopefully accurate picture of how many links you have, then you can parse them one by one. It's not ideal because it doesn't accept regex as the target parameter.

Nick Anderson
  • 138
  • 2
  • 11
  • well.. is a good approximation method... not my personal choice, but good to have in case.. one never knows. Thanks. – Victor Jun 29 '15 at 03:58