0

I am trying to parse robots.txt and check for allowance in java. I arranged allowed and disallowed parts of robots.txt in list. I thought that just using java's url_string.equals() function will be sufficient for matching url. But robots.txt contains dollar($) sign for matching whether the url ends with given patterrn or not and asterisk(*) for any character between String. Here is the asterisk matching function that I am using:-

public boolean asteriskWildcardMatch(String str, String pattern) {
    int n=str.length();
    int m=pattern.length();
    if (m == 0) {return (n == 0); }
    boolean[][] matchLookup = new boolean[n + 1][m + 1]; 
    for(int i = 0; i < n + 1; i++) {Arrays.fill(matchLookup[i], false);} 
    matchLookup[0][0] = true;
    for (int j = 1; j <= m; j++) {
        if (pattern.charAt(j - 1) == '*') {
            matchLookup[0][j] = matchLookup[0][j - 1];
        }
    }
    for (int i = 1; i <= n; i++){ 
        for (int j = 1; j <= m; j++) { 
            if (pattern.charAt(j - 1) == '*') {
                matchLookup[i][j] = matchLookup[i][j - 1] || matchLookup[i - 1][j];
            }
            else if (str.charAt(i - 1) == pattern.charAt(j - 1)){
                matchLookup[i][j] = matchLookup[i - 1][j - 1];
            }
            else{ matchLookup[i][j] = false; }
        } 
    }       
    return matchLookup[n][m]; 
}

Robots.txt:-

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/*/about
Disallow: /app/comments$
Allow: /articles/*-admin$
Disallow: /preferences
Disallow: /setprefs

The code works well. But I am confused while making dollar sign matching function for ending patterns. And there are many bugs too in the function. There can be many asterisk in a url. But dollar can only be at end. Can anyone help by suggesting giving any code snippet or java regex code?

Thanks in advance.

  • Parsing robots text files is discussed [here](https://stackoverflow.com/questions/19332982/parsing-robot-txt-using-java-and-identify-whether-an-url-is-allowed). For example, see [this](https://github.com/crawler-commons/crawler-commons) - does any of that help? – andrewJames Mar 25 '20 at 14:16
  • I have already sen these. These have bunches of bugs . So I decided to make my own robots.txt parser. –  Mar 25 '20 at 14:19

0 Answers0