3

I am attempting to request the price data from dukascopy.com but I am running into a similar problem to this user, where the price data itself is not a part of the html. Therefore, when I run my basic urllib code to extract the data:

import urllib.request
url = 'https://www.dukascopy.com'
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
print(str(respData))

the price data cannot be found. Referring back to this post, the user Mark found another url that the data was called from. Can this be applied to collect the data here as well?

Community
  • 1
  • 1
L. Chen
  • 31
  • 2
  • 2
    You'll need something that emulates a browser and can handle javascript to load the price data. Using Selenium is one option. –  Dec 30 '16 at 16:59
  • 1
    Does dukascopy have a developer friendly way of getting the data? I searched "dukascopy developer" and found a java api and other links. Not sure if any are helpful to you. – tdelaney Dec 30 '16 at 17:09
  • You will also want to make sure that what you are doing doesn't break the terms and services. In some cases scraping without permission can be illegal. – SudoKid Dec 30 '16 at 17:10

1 Answers1

1

Try with dryscape. You can scrape JavaScript rendered pages with it. Don't parse web pages with regex module. It's not a good idea. Read this why you should not parse HTML pages with regex: HTML with regex. Use Beautiful for parsing.

import dryscrape
from bs4 import BeautifulSoup

url = 'https://www.dukascopy.com'
session = dryscrape.Session()
session.visit(url)
response = session.body()
soup=BeautifulSoup(response)
print soup
Community
  • 1
  • 1
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
  • Well your answer is not wrong its the OP asked for help doing it with `urllib`. – SudoKid Dec 30 '16 at 17:04
  • 1
    @EmettSpeer Anything that gets OP closer to solution can be posted as an answer. And it's valid to say "don't try that, try this instead". – Mohammad Yusuf Dec 30 '16 at 17:05
  • I didn't mean it that way. – SudoKid Dec 30 '16 at 17:07
  • 1
    I was just reading over `dryscape` to see if it would render JavaScript. Thank you for introducing me to it. – SudoKid Dec 30 '16 at 17:08
  • @EmettSpeer Well I don't know how to do it with `urllib`. I'm waiting to learn it from you :P – Mohammad Yusuf Dec 30 '16 at 17:16
  • I didn't say there was a way to do it with `urllib`. Your answer just didn't give the reason not to use `urllib`. – SudoKid Dec 30 '16 at 17:17
  • @EmettSpeer Actually even I don't know much if it's possible with `urllib` or not. I have just read it from here and there that you cannot parse JS rendered pages with `urllib` or `requests`. Now you are compelling me to find out about it :P. Ok I'll do that and update my answer. – Mohammad Yusuf Dec 30 '16 at 17:26
  • I would think its possible to do with `urllib` but only if you have away to render the JavaScript which would require some kind of 3rd party package. There is `PyQt4.QtWebKit`, `pythonwebkit` though I don't know if its maintained. That's all I could find right now. – SudoKid Dec 30 '16 at 17:31
  • Hmm.. Thanks. People also use `selenium` it's very slow but very flexible. – Mohammad Yusuf Dec 30 '16 at 17:38
  • from reading over the ghost.py documentation it seems to be very selenium like. There is also some headless browsers that you can use with selenium such as `phantomjs`. – SudoKid Dec 30 '16 at 17:54
  • Thank you for your answer, though I would prefer to find the source of the data directly and avoid having to install additional modules. – L. Chen Dec 30 '16 at 20:33