2

I'm trying to extract the span tag content from the google translate website. The content is the translated result which has the id="result_box". When tried to print the contents, it returns None value.

Please check the Image here

import requests
from bs4 import BeautifulSoup

r = requests.get("https://translate.google.co.in/?rlz=1C1CHZL_enIN729IN729&um=1&ie=UTF-8&hl=en&client=tw-ob#en/fr/good%20morning")

soup = BeautifulSoup(r.content, "lxml")
spanner = soup.find(id = "result_box")

result = spanner.text
Ankit Dev
  • 27
  • 1
  • 5
  • The problem is that requests doesn't execute javascript, so if you visit the link you try to scrap, you will see something like http://imgur.com/a/lwSc5. That's why always returns None. – Roomm Jul 25 '17 at 14:08
  • @AnkitDev the result is probably set by javascript hence its not present in the body when u send request.to simulate browser you could use `selenium` http://selenium-python.readthedocs.io/ – anekix Jul 25 '17 at 14:09
  • If you need google translate you should check this https://ctrlq.org/code/19909-google-translate-api – Roomm Jul 25 '17 at 14:11
  • Before you put a lot of effort into it, keep in mind that google will block you if you do a lot of automatic requests. (Although you can still use it after validating with a captcha image that you are not a robot). – 1408786user Jul 25 '17 at 14:11

1 Answers1

2

Requests doesn't execute JavaScript, you could use selenium and PhantomJS for the headless browsing like this:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://translate.google.co.in/?rlz=1C1CHZL_enIN729IN729&um=1&ie=UTF-8&hl=en&client=tw-ob#en/fr/good%20morning"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'lxml')
spanner = soup.find(id = "result_box")
result = spanner.text

This gives our expected result:

>>> result
'Bonjour'
Vinícius Figueiredo
  • 6,300
  • 3
  • 25
  • 44
  • Thank you Vinícius, that was a great idea and it actually fixed many of other problems. However, the above code takes about 5-6 seconds to execute and give the output and it leaves a phantomjs.exe window on the screen. Is there any way to fasten up the execution time and get rid of that exe window? – Ankit Dev Jul 27 '17 at 15:11
  • I'm glad to help! I'm not sure about performance, maybe ChromeDriver is faster, but I really don't have this knowledge. About hiding the command line, I never tried it, but this question seems to be what you want: https://stackoverflow.com/questions/25871898/how-to-hide-chromedriver-console-window – Vinícius Figueiredo Jul 27 '17 at 18:17