0

trying to crawl using NUTCH 1.17 but the URL is being rejected, there is #! in the URL example : xxmydomain.com/xxx/#!/xxx/abc.html

also I have tried to include

+^/

+^#! in my regex-urlfilter

Aks
  • 13
  • 5
  • try adding this rule +(?i).* – kavetiraviteja Sep 19 '20 at 18:42
  • @kavetiraviteja its is accepting all the urls in the page but not with relative url having /#!/ – Aks Sep 20 '20 at 05:45
  • trying to crawl the below url, if you do a view source there are few absolute url and relative urls i am able to crawl absolute url but not able to crawl reltive urls. https://www.codepublishing.com/CA/AlpineCounty/#!/AlpineCounty01/AlpineCounty01.html – Aks Sep 20 '20 at 05:52

1 Answers1

1
  1. If you particularly check in the regex-normalize.xml file This particular rule file will be applied as part of urlnormalizer-regex plugin. This plugin is default included in plugin-includes in nutch-site.xml.

As part of URL Normalizationg, This particular line will truncate URLs if anything present after URLFragment

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>

You can disable this rule by commenting. (recommended way) (OR) you can remove urlnormalizer-regex from plugin-include conf from nutch-site.xml.

  1. There is one more place where the URL fragment part is ignored in the URL normalization part which is urlnormalizer-basic

BasicURLNormalizer is used for applying general normalization on URL's(i.e removing multiple immediate slashes and properly encode using percent-encoding)

    public String normalize(String urlString, String scope)
      throws MalformedURLException {
    
    if ("".equals(urlString)) // permit empty
      return urlString;

    urlString = urlString.trim(); // remove extra spaces

    URL url = new URL(urlString);

    String protocol = url.getProtocol();
    String host = url.getHost();
    int port = url.getPort();
    String file = url.getFile();

    boolean changed = false;
    boolean normalizePath = false;

    if (!urlString.startsWith(protocol)) // protocol was lowercased
      changed = true;

    if ("http".equals(protocol) || "https".equals(protocol)
        || "ftp".equals(protocol)) {

      if (host != null && url.getAuthority() != null) {
        String newHost = normalizeHostName(host);
        if (!host.equals(newHost)) {
          host = newHost;
          changed = true;
        } else if (!url.getAuthority().equals(newHost)) {
          // authority (http://<...>/) contains other elements (port, user,
          // etc.) which will likely cause a change if left away
          changed = true;
        }
      } else {
        // no host or authority: recompose the URL from components
        changed = true;
      }

      if (port == url.getDefaultPort()) { // uses default port
        port = -1; // so don't specify it
        changed = true;
      }

      normalizePath = true;
      if (file == null || "".equals(file)) {
        file = "/";
        changed = true;
        normalizePath = false; // no further path normalization required
      } else if (!file.startsWith("/")) {
        file = "/" + file;
        changed = true;
        normalizePath = false; // no further path normalization required
      }

      if (url.getRef() != null) { // remove the ref
        changed = true;
      }

    } else if (protocol.equals("file")) {
      normalizePath = true;
    }

    // properly encode characters in path/file using percent-encoding
    String file2 = unescapePath(file);
    file2 = escapePath(file2);
    if (!file.equals(file2)) {
      changed = true;
      file = file2;
    }

    if (normalizePath) {
      // check for unnecessary use of "/../", "/./", and "//"
      if (changed) {
        url = new URL(protocol, host, port, file);
      }
      file2 = getFileWithNormalizedPath(url);
      if (!file.equals(file2)) {
        changed = true;
        file = file2;
      }
    }

    if (changed) {
      url = new URL(protocol, host, port, file);
      urlString = url.toString();
    }

    return urlString;
  }

you can see from the code.. it is completely ignoring **url.getRef** Information which contains URLFragment.

so, what we can do is just simply replace url = new URL(protocol, host, port, file);

at the end of the normalize method(line number)

with url = new URL(protocol, host, port, file+"#"+url.getRef());

How did I validated?.

scala> val url = new URL("https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html");
url: java.net.URL = https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html

scala> val protocol = url.getProtocol();
protocol: String = https

scala>     val host = url.getHost();
host: String = www.codepublishing.com

scala>     val port = url.getPort();
port: Int = -1

scala>     val file = url.getFile();
file: String = /CA/AlisoViejo/

scala> //when we construct back new url using the above information we end up loosing fragment information like shown in below

scala> new URL(protocol, host, port, file).toString
res69: String = https://www.codepublishing.com/CA/AlisoViejo/

scala> //if we use url.getRef Information in constructing url we can retain back URL fragment information

scala> //like shown below

scala> new URL(protocol, host, port, file+"#"+url.getRef).toString
res70: String = https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html

scala> // so we can replace the url construction object as explained above to retain url fragment information

Note: UrlFragment will provide local object references within the page. it does not make sense to crawl those URL's in most of the cases(that is why nutch normalize URL with the above rule) because HTML will remain the same.

kavetiraviteja
  • 2,058
  • 1
  • 15
  • 35
  • Hi, i had commented regex Normalize earlier but as per your suggestion i removed it from nutch-site.xml also, but still not able to crawl the pages. below is the seed url https://www.codepublishing.com/CA/AlisoViejo/#!/AlisoViejo01/AlisoViejo01.html i have noticed when i crawl even though my url is upto .html, it is taken only till https://www.codepublishing.com/CA/AlisoViejo/ – Aks Sep 21 '20 at 06:45
  • @Aks please provide me plugin.includes conf ... I will check once – kavetiraviteja Sep 21 '20 at 07:21
  • plugin.includes protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor|static|replacefacet|split)|indexer-solr|scoring-opic|urlnormalizer-(pass|basic) – Aks Sep 21 '20 at 10:49
  • @Aks I have checked it by running code on BasicURLNormalizer.. BasicURLNormalizer is also removing url fragment part ..... I will update my answer with finding. – kavetiraviteja Sep 21 '20 at 11:56
  • 1
    thank you. As per your suggestion i was able to modify the BasicURLNormalizer and crawl the url with #!. – Aks Sep 23 '20 at 07:32