1

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like:

String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);

When crawler 4J starts the URL Crawled is : https://github.com/search?q=java%2Blocation%3AIndia&p=1

which gives me error page. What should I do, I have tried giving encoded url but that doesn't work either.

ravi katiyar
  • 11
  • 1
  • 7
  • Read the [source](https://github.com/yasserg/crawler4j/blob/master/crawler4j/src/main/java/edu/uci/ics/crawler4j/url/URLCanonicalizer.java) luke! See line number 185 `string = string.replace("+", "%2B");` This is what's causing your normal URL to get all funky – absin Feb 08 '18 at 05:56
  • I know what is causing my URL to change. what can I do ? I have to add the seed URL to controller. The URL is getting encoded and gives error results – ravi katiyar Feb 08 '18 at 12:22
  • I have used `Nutch` in past and there were configurations to permit parameterized seed URLs. However after going through the documentation pages for `crawler4j`, I couldn't find anything. One thing you can do is try to change the source code and see for yourself. – absin Feb 08 '18 at 12:34
  • Read [this](https://stackoverflow.com/questions/11379486/should-a-web-crawler-pick-up-queries) to understand why you are observing this seemingly strange but reasonable behavior in your crawler. – absin Feb 08 '18 at 12:40
  • Changing source code of crawler4J was first solution that came to my mind , I thought someone would provide a better work around – ravi katiyar Feb 08 '18 at 12:56
  • Sorry I can't help, haven't used `Crawler4j`, will give your question a bump though. You can try `Nutch` its quite powerful and simple. – absin Feb 08 '18 at 13:20
  • The matter is a bit complicated. See the github issue [0] and the linked SO [1] [0] https://github.com/yasserg/crawler4j/issues/374#issuecomment-446751962 [1] https://stackoverflow.com/a/47188851/4510569 – s17t.net Dec 12 '18 at 21:33

1 Answers1

0

I had to eventually make the slightest of changes to crawler4J source code: File Name: URLCanonicalizer.java Method : percentEncodeRfc3986

Just commented the first line in this method and I was able to crawl and fetch my results

//string = string.replace("+", "%2B");

In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.

ravi katiyar
  • 11
  • 1
  • 7
  • The matter is a bit complicated. See the github issue [0] and the linked SO [1] [0] https://github.com/yasserg/crawler4j/issues/374#issuecomment-446751962 [1] https://stackoverflow.com/a/47188851/4510569 – s17t.net Dec 12 '18 at 21:32