20

Edit I now realize the API is simply inadequate and is not even working. I would like to redirect my question, I want to be able to auto-magically search duckduckgo using their "I'm feeling ducky". So that I can search for "stackoverflow" for instance and get the main page ("https://stackoverflow.com/") as my result.

I am using the duckduckgo API. Here

And I found that when using:

r = duckduckgo.query("example")

The results do not reflect a manual search, namely:

for result in r.results:
    print result

Results in:

>>> 
>>> 

Nothing.

And looking for an index in results results in an out of bounds error, since it is empty.

How am I supposed to get results for my search?

It seems the API (according to its documented examples) is supposed to answer questions and give a sort of "I'm feeling ducky" in the form of r.answer.text

But the website is made in such a way that I can not search it and parse results using normal methods.

I would like to know how I am supposed to parse search results with this API or any other method from this site.

Thank you.

Community
  • 1
  • 1
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131

5 Answers5

31

If you visit DuckDuck Go API Page, you will find some notes about using the API. The first notes says clearly that:

As this is a Zero-click Info API, most deep queries (non topic names) will be blank.

An here's the list of those fields:

Abstract: ""
AbstractText: ""
AbstractSource: ""
AbstractURL: ""
Image: ""
Heading: ""
Answer: ""
Redirect: ""
AnswerType: ""
Definition: ""
DefinitionSource: ""
DefinitionURL: ""
RelatedTopics: [ ]
Results: [ ]
Type: ""

So it might be a pity, but their API just truncates a bunch of results and does not give them to you; possibly to work faster, and seems like nothing can be done except using DuckDuckGo.com.

So, obviously, in that case API is not the way to go.

As for me, I see only one way out left: retrieving raw html from duckduckgo.com and parsing it using, e.g. html5lib (it worth to mention that their html is well-structured).

It also worth to mention that parsing html pages is not the most reliable way to scrap data, because html structure can change, while API usually stays stable until changes are publicly announced.

Here's and example of how can be such parsing achieved with BeautifulSoup:

from BeautifulSoup import BeautifulSoup
import urllib
import re

site = urllib.urlopen('http://duckduckgo.com/?q=example')
data = site.read()

parsed = BeautifulSoup(data)
topics = parsed.findAll('div', {'id': 'zero_click_topics'})[0]
results = topics.findAll('div', {'class': re.compile('results_*')})

print results[0].text

This script prints:

u'Eixample, an inner suburb of Barcelona with distinctive architecture'

The problem of direct querying on the main page is that it uses JavaScript to produce required results (not related topics), so you can use HTML version to get results only. HTML version has different link:

Let's see what we can get:

site = urllib.urlopen('http://duckduckgo.com/html/?q=example')
data = site.read()
parsed = BeautifulSoup(data)

first_link = parsed.findAll('div', {'class': re.compile('links_main*')})[0].a['href']

The result stored in first_link variable is a link to the first result (not a related search) that search engine outputs:

http://www.iana.org/domains/example

To get all the links you can iterate over found tags (other data except links can be received similar way)

for i in parsed.findAll('div', {'class': re.compile('links_main*')}):
    print i.a['href']

http://www.iana.org/domains/example
https://twitter.com/example
https://www.facebook.com/leadingbyexample
http://www.trythisforexample.com/
http://www.myspace.com/leadingbyexample?_escaped_fragment_=
https://www.youtube.com/watch?v=CLXt3yh2g0s
https://en.wikipedia.org/wiki/Example_(musician)
http://www.merriam-webster.com/dictionary/example
...

Note that HTML-only version contains only results, and for related search you must use JavaScript version. (vithout html part in url).

Rostyslav Dzinko
  • 39,424
  • 5
  • 49
  • 62
  • thank you. this helps me understand what the problem is, where did you find that? :P i tried writing a parser for the regular html page of duckduckgo, but i was having problems because it uses java or something and the results didnt come out in proper html format... – Inbar Rose Aug 13 '12 at 07:25
  • It works fine for me with BeautifulSoup. Will update the answer – Rostyslav Dzinko Aug 13 '12 at 09:53
  • well, thats wrong, the result you get is from the related searches. – Inbar Rose Aug 13 '12 at 10:01
  • It's just an example of that the page is consistent HTML, you can do this way to get all another results – Rostyslav Dzinko Aug 13 '12 at 10:03
  • so using the html page, can i get more than just one result? – Inbar Rose Aug 13 '12 at 10:58
  • Yes, you can iterate over links, just look at BeautifulSoup documentation how to process parsed data. – Rostyslav Dzinko Aug 13 '12 at 11:07
  • could you post a new answer so i can accept it more concisely. which uses the HTML duckduckgo, and then returns a list of results. – Inbar Rose Aug 13 '12 at 11:09
  • is there any kind of TOS in duckduckgo (like in google) that prohibit scrapping their page? – Mr Alexander Jul 15 '16 at 09:18
  • is it ok to web scrap the search results of duckduckgo ? when open their robots.txt file, `User-agent: * Disallow: /*?`. In the other hand on [link](https://duckduckgo.com/traffic.html) we can see that million of searches are made by bots ? – J. Doe Aug 25 '17 at 14:30
  • lets check : `rp = urllib.robotparser.RobotFileParser(); rp.set_url("https://duckduckgo.com/robots.txt"); rp.read(); rp.can_fetch("*", "https://duckduckgo.com/?q=can we scrap duckduckgo?")` returns True – J. Doe Aug 25 '17 at 14:39
  • however : `rp.can_fetch("*", "https://duckduckgo.com/html/?q=can we scrap duckduckgo")` returns False – J. Doe Aug 25 '17 at 15:10
2

After already getting an answer to my question which I accepted and gave bounty for - I found a different solution, which I would like to add here for completeness. And a big thank you to all those who helped me reach this solution. Even though this isn't the solution I asked for, it may help someone in the future.

Found after a long and hard conversation on this site and with some support mails: https://duck.co/topic/strange-problem-when-searching-intel-with-my-script

And here is the solution code (from an answer in the thread posted above):

>>> import duckduckgo
>>> print duckduckgo.query('! Example').redirect.url
http://www.iana.org/domains/example
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131
0

Try:

for result in r.results:
    print result.text
couchemar
  • 1,927
  • 19
  • 22
  • same result, nothing. the problem is that the r.results is an empty array, the API is returning no results at all. – Inbar Rose Jul 30 '12 at 14:40
  • r.related returns related searches/queries which is not what i am trying to get though... even though in some instances it could be usefull. obviously it is a sort of "duct-tape solution" – Inbar Rose Jul 30 '12 at 14:47
  • if you try: http://api.duckduckgo.com/?q=example&format=xml&pretty=1 you get empty results too. – couchemar Jul 30 '12 at 14:50
  • true, but obviously my code is not searching for "example" most everything else also returns no results as well. – Inbar Rose Jul 30 '12 at 14:58
0

If it suits your application, you might also try the related searches

r = duckduckgo.query("example")
for i in r.related_searches:
    if i.text:
        print i.text

This yields:

Eixample, an inner suburb of Barcelona with distinctive architecture
Example (musician), a British musician
example.com, example.net, example.org, example.edu  and .example, domain names reserved for use in documentation as examples
HMS Example (P165), an Archer-class patrol and training vessel of the British Royal Navy
The Example, a 1634 play by James Shirley
The Example (comics), a 2009 graphic novel by Tom Taylor and Colin Wilson
Bolutife Ogunsola
  • 298
  • 1
  • 3
  • 13
0

For python 3 users, the transcription of @Rostyslav Dzinko's code:

import re, urllib
import pandas as pd
from bs4 import BeautifulSoup

query = "your query"
site = urllib.request.urlopen("http://duckduckgo.com/html/?q="+query)
data = site.read()
soup = BeautifulSoup(data, "html.parser")

my_list = soup.find("div", {"id": "links"}).find_all("div", {'class': re.compile('.*web-result*.')})[0:15]


(result__snippet, result_url) = ([] for i in range(2))

for i in my_list:         
      try:
            result__snippet.append(i.find("a", {"class": "result__snippet"}).get_text().strip("\n").strip())
      except:
            result__snippet.append(None)
      try:
            result_url.append(i.find("a", {"class": "result__url"}).get_text().strip("\n").strip())
      except:
            result_url.append(None)
J. Doe
  • 3,458
  • 2
  • 24
  • 42