0

Here is the idea.

I have master articles say from the site BBC news. This master article is originally published by BBC news, but it may be used by many other sites across the web.

Approach 1:

Since Google doesn't provide any API. I implemented a program to fetch links from Google search results using Python and mechanize. However, this approach is not recommendable because my IP may get blocked. I don't want to risk doing it.

How I did?

I used the article title and author of the article combined as a boolean query to get only the matching article similar to master article. Results are quite good, but I don't want to go with this one.

Approach 2:

I tried with Google custom search querying with keywords from master article restricting the search only to limited sites instead of whole web. But the results are not good. I need only the links pointing to the articles used by other sites.

Can anyone tell me some better approach? Is there any libraries available for such purpose which can i make use of?

DanGar
  • 3,018
  • 17
  • 17
user
  • 141
  • 1
  • 10

2 Answers2

0

The conventional way to solve this problem is arguably though information retrieval (IR) and natural language processing. For starters see Similarity between two text documents, or refer to any book on this subject. Appropriate python libraries are sklearn and NTLK

Community
  • 1
  • 1
Emre
  • 5,976
  • 7
  • 29
  • 42
  • my objective is to find the similar articles across **web**.my question is whether google is the only way to get those articles or any other ideas? – user May 02 '14 at 06:23
0

If you are afraid of your IP getting banned as you are scrapping search results, you might want to consider another search engine's API that does offer the data (or thresholds) you need.

For example, Microsoft offers Bing's Web Search API

http://www.bing.com/developers/s/APIBasics.html

With this approach, you do not unintentionally violate some TOS.

Since you did not specify what you were searching for specifically, you may be able to find a API for your "article" in:

http://www.programmableweb.com/apis/directory/1?apicat=Search

DanGar
  • 3,018
  • 17
  • 17