How to extract the tag contents using the Beautiful Soup?

Question

I'm trying to extract the span tag content from the google translate website. The content is the translated result which has the id="result_box". When tried to print the contents, it returns None value.

Please check the Image here

import requests
from bs4 import BeautifulSoup

r = requests.get("https://translate.google.co.in/?rlz=1C1CHZL_enIN729IN729&um=1&ie=UTF-8&hl=en&client=tw-ob#en/fr/good%20morning")

soup = BeautifulSoup(r.content, "lxml")
spanner = soup.find(id = "result_box")

result = spanner.text

The problem is that requests doesn't execute javascript, so if you visit the link you try to scrap, you will see something like http://imgur.com/a/lwSc5. That's why always returns None. — Roomm, Jul 25 '17 at 14:08
@AnkitDev the result is probably set by javascript hence its not present in the body when u send request.to simulate browser you could use `selenium` http://selenium-python.readthedocs.io/ — anekix, Jul 25 '17 at 14:09
If you need google translate you should check this https://ctrlq.org/code/19909-google-translate-api — Roomm, Jul 25 '17 at 14:11
Before you put a lot of effort into it, keep in mind that google will block you if you do a lot of automatic requests. (Although you can still use it after validating with a captcha image that you are not a robot). — 1408786user, Jul 25 '17 at 14:11

score 2 · Accepted Answer · answered Jul 26 '17 at 00:13

2

Requests doesn't execute JavaScript, you could use selenium and PhantomJS for the headless browsing like this:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://translate.google.co.in/?rlz=1C1CHZL_enIN729IN729&um=1&ie=UTF-8&hl=en&client=tw-ob#en/fr/good%20morning"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'lxml')
spanner = soup.find(id = "result_box")
result = spanner.text

This gives our expected result:

>>> result
'Bonjour'

answered Jul 26 '17 at 00:13

Vinícius Figueiredo

6,300
3
25
44

Thank you Vinícius, that was a great idea and it actually fixed many of other problems. However, the above code takes about 5-6 seconds to execute and give the output and it leaves a phantomjs.exe window on the screen. Is there any way to fasten up the execution time and get rid of that exe window? – Ankit Dev Jul 27 '17 at 15:11
I'm glad to help! I'm not sure about performance, maybe ChromeDriver is faster, but I really don't have this knowledge. About hiding the command line, I never tried it, but this question seems to be what you want: https://stackoverflow.com/questions/25871898/how-to-hide-chromedriver-console-window – Vinícius Figueiredo Jul 27 '17 at 18:17

How to extract the tag contents using the Beautiful Soup?

1 Answers1

Linked