Crawler4J seed url gets encoded and error page is crawler instead of actual page

Question

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like:

String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);

When crawler 4J starts the URL Crawled is : https://github.com/search?q=java%2Blocation%3AIndia&p=1

which gives me error page. What should I do, I have tried giving encoded url but that doesn't work either.

Read the [source](https://github.com/yasserg/crawler4j/blob/master/crawler4j/src/main/java/edu/uci/ics/crawler4j/url/URLCanonicalizer.java) luke! See line number 185 `string = string.replace("+", "%2B");` This is what's causing your normal URL to get all funky — absin, Feb 08 '18 at 05:56
I know what is causing my URL to change. what can I do ? I have to add the seed URL to controller. The URL is getting encoded and gives error results — ravi katiyar, Feb 08 '18 at 12:22
I have used `Nutch` in past and there were configurations to permit parameterized seed URLs. However after going through the documentation pages for `crawler4j`, I couldn't find anything. One thing you can do is try to change the source code and see for yourself. — absin, Feb 08 '18 at 12:34
Read [this](https://stackoverflow.com/questions/11379486/should-a-web-crawler-pick-up-queries) to understand why you are observing this seemingly strange but reasonable behavior in your crawler. — absin, Feb 08 '18 at 12:40
Changing source code of crawler4J was first solution that came to my mind , I thought someone would provide a better work around — ravi katiyar, Feb 08 '18 at 12:56
Sorry I can't help, haven't used `Crawler4j`, will give your question a bump though. You can try `Nutch` its quite powerful and simple. — absin, Feb 08 '18 at 13:20
The matter is a bit complicated. See the github issue [0] and the linked SO [1] [0] https://github.com/yasserg/crawler4j/issues/374#issuecomment-446751962 [1] https://stackoverflow.com/a/47188851/4510569 — s17t.net, Dec 12 '18 at 21:33

score 0 · Accepted Answer · answered Feb 20 '18 at 04:43

0

I had to eventually make the slightest of changes to crawler4J source code: File Name: URLCanonicalizer.java Method : percentEncodeRfc3986

Just commented the first line in this method and I was able to crawl and fetch my results

//string = string.replace("+", "%2B");

In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.

answered Feb 20 '18 at 04:43

ravi katiyar

11
1
7

The matter is a bit complicated. See the github issue [0] and the linked SO [1] [0] https://github.com/yasserg/crawler4j/issues/374#issuecomment-446751962 [1] https://stackoverflow.com/a/47188851/4510569 – s17t.net Dec 12 '18 at 21:32

Crawler4J seed url gets encoded and error page is crawler instead of actual page

1 Answers1