Recrawl URL with Nutch just for updated sites

Question

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

score 7 · Accepted Answer · edited May 23 '17 at 12:33

7

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

edited May 23 '17 at 12:33

Community

1
1

answered Jan 10 '13 at 15:45

İsmet Alkan

5,361
3
41
64

How JOB scheduler compare the crawling if it is updated or it is same? i mean how nutch or solar compare the content? – Emrah Mehmedov Jan 10 '13 at 15:47
1

So, every page should be checked if there are some changes compare with the old one and if there is new stuff, then the page will be crawled. If I understand right, I just need simple function for this that will compare strings? – Ilce MKD Jan 10 '13 at 16:09
1

That's correct. But you might be looking for a change in a specific are in the page, after you get the raw HTML right, you can easily determine what to do. – İsmet Alkan Jan 10 '13 at 16:22
2

I disagree on that Nutch provides the ability to detect the pages that are new and updated and should be able to do this for you. – Jayendra Jan 11 '13 at 06:02

score 5 · Answer 2 · answered Jan 11 '13 at 06:05

5

You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

Article describes the same in detail.

answered Jan 11 '13 at 06:05

Jayendra

52,349
4
80
90

Ok, I read the article and I have another question.Do I have to use any job sheduler for run my command for crawl the given url or I need Adaptive Fetch scheduler to do this? And if the Adaptive Fetch is the right one how can I use it? – Ilce MKD Jan 11 '13 at 16:00
you can configure adaptice schedule wihtin in config. And you would need a scheduler to fire the job e.g. Autosys, Quartz etc. – Jayendra Jan 11 '13 at 17:13
3

I will have to disagree with you here. The class you mention works according to the crawled site's "if-modified-since" and "last-modified" http headers. And I must tell, none of the sites around (except for google, youtube, stackoverflow etc.) mustn't be trusted on the truthfulness of these headers. – İsmet Alkan Jan 11 '13 at 17:20
If you are building the site, its upon you to take care of this so that crawling works fine for you. – Jayendra Jan 12 '13 at 09:44
1

I don't really understand you here. You mean you're crawling your own website, you yourself made? Why? :) – İsmet Alkan Jan 12 '13 at 11:25
Why not :) We had huge number of Intranet and news sites. We want to allow people of search through this sites and we use nutch incremental indexing as we cannot index all the content always. Here we can control to indicate to Nutch when the page was updated. – Jayendra Jan 12 '13 at 11:56
@IsmetAlkan: I think Jayendra is right. It is explained in the article. I don't think AdapativeFetchSchedule just relies on "if-modified-since" and "last-modified" http headers. From the article - Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (*OR* if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not. – sunskin Apr 30 '14 at 20:49
More importantly the last line in this - "By default the signature of a page is built not only with its content, but also with the http headers returned with the page. So even if the content of a page has not changed, if an http header is not the same (like an etag or a date), the signature changes. To solve that problem, there is the TextProfileSignature class. *It is designed to look only at the text content of a page to build the signature*." – sunskin Apr 30 '14 at 20:50

score 2 · Answer 3 · edited May 23 '17 at 11:48

2

what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle nutch

I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

edited May 23 '17 at 11:48

Community

1
1

answered Jan 13 '13 at 09:50

user1973842

113
1
9

What are you actually thinking recommending the same article that is recommended in a previous answer? – İsmet Alkan Jan 13 '13 at 12:35

Recrawl URL with Nutch just for updated sites

3 Answers3

Linked