2

Can someone help me I want to extract html data from http://www.quranexplorer.com/Hadith/English/Index.html. I have found a service that does exactly that http://diffbot.com/dev/docs/ they support data extraction via a simple api, the problem it that I have a large number of url that needs that needs to be processed. The link below http://test.deen-ul-islam.org/html/h.js

I need to create a script that that follows the url then using the api generate the json format of the html data (the apis from the site allows batch requests check website docs)

Please note diffbot only allows 10000 free request per month so I need a way to save the progress and be able to pick up where I left off.

Here is an example I created using php.

$token = "dfoidjhku";// example token
$url = "http://www.quranexplorer.com/Hadith/English/Hadith/bukhari/001.001.006.html";
$geturl="http://www.diffbot.com/api/article?tags=1&token=".$token."&url=".$url;
$json = file_get_contents($geturl);
$data = json_decode($json, TRUE);
echo $article_title=$data['title'];
echo $article_author=$data['author'];
echo $article_date=$data['date'];
echo nl2br($article_text=$data['text']);
$article_tags=$data['tags'];
foreach($article_tags as $result) {
    echo $result, '<br>';
}

I don't mind if the tool is in javascript or php I just need a way to get the html data in json format.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user5601
  • 55
  • 1
  • 11

1 Answers1

2

John from Diffbot here. Note: not a developer, but know enough to write hacky code to do simple things.

You have a list of links -- it should be straightforward to iterate through those, making a call to us for each.

Here's a Python script that does such: https://gist.github.com/johndavi/5545375

I used a quick search regex in Sublime Text to pull out the links from the JS file.

To truncate this, just cut out some of the links, then run it. It will take a while as I'm not using the Batch API.

If you need to improve or change this, best seek out a stronger developer directly. Diffbot is a dev-friendly tool.

  • hi thanks for the code, the only problem is i dont know how to use python, any chance you could create a php version of the script, – user5601 May 09 '13 at 17:58
  • Hi, thanks -- I'm sorry, that's as far as I can go. I'm sure a handful of PHP guides would easily help you replicate this. Good luck! – Diffbot John May 10 '13 at 18:06