How to distinguish between removed and modified news article during crawling web pages?

Question

I'm writing a web crawler which extracts cleaned news articles text and metadata using diffbot api. It also makes a logging of an article title and text changes if their source were modified since last extraction. I need some automatic way to distinguish between erased and changed article: news portals mostly don't return 404 or other error codes in case if the post was deleted, often they send 200 and page with caption like "Sorry, the article you looking for was removed". So, I need a tool or approach to detect that kind of situations, preferably it should be something written in Python or something with web API. I am totally confused and have no idea where even to begin, so any reasonable suggestions widely appreciated.

score 0 · Answer 1 · answered Aug 13 '21 at 22:57

0

you can:

set a minimum length of an article to expect and treat any short text as a removed one
compare the Diffbot URI (a unique string) across two articles of the same URL to notice that their body has changed

These two in tandem should provide you with the diffing capability you seek.

answered Aug 13 '21 at 22:57

Swader

11,387
14
50
84

I thought about checking for too short lines, but there is an issue with it: diffbot often retrievers some additional text along with requested article (from hidden html I guess). So even if main text of page consist of the single line, that mistaken trash will make whole tetx pretty big, to confuse it with a article of a regular size – Kaderma Aug 15 '21 at 00:46

How to distinguish between removed and modified news article during crawling web pages?

1 Answers1