1

I'm trying to scrape data from Google translate for educational purpose.

Here is the code

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full

class Phonetizer:
    def __init__(self,sentence : str,language_ : str = 'en'):
        self.words=sentence.split()
        self.language=language_
    def get_phoname(self):
        for word in self.words:
            print(word)
            url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
            print(url)
            req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
            webpage = urlopen(req).read()
            f= open("debug.html","w+")
            f.write(webpage.decode("utf-8"))
            f.close()
            #print(webpage)
            bsoup = BeautifulSoup(webpage,'html.parser')
            phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
            print(phonems)
            #break

The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.

But using inspect, I have found that, phoneme are inside this css class, here take a snap :

google_translate_scrap

I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.

How can I do so ? Access the whole page and read all data from google translate page?

Want to contribute on this project?

I have tried this code below :

from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
    r = await asession.get(url)
    print(r)
    return r
results = asession.run(get_url)
for result in results:
    print(result.html.url)
    print(result.html.find('#tlid-transliteration-content'))
    print(result.html.find('#tlid-transliteration-content transliteration-content full'))

It gives me nothing, till now.

Ismael Padilla
  • 5,246
  • 4
  • 23
  • 35
Maifee Ul Asad
  • 3,992
  • 6
  • 38
  • 86

3 Answers3

2

Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.

Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.

from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō
Marsilinou Zaky
  • 1,038
  • 7
  • 17
1

First of all, I would suggest you to use the Google Translate API instead of scraping google page. The API is a hundred times easier, hassle-free and a legal and conventional way of doing this.

However, if you want to fix this, here is the solution. You are not dealing with Bot detection here. Google's bot detection is so strong it would just open the google re-captcha page and not even show your desired web-page. The problem here is that the results of translation are not returned using the URL you have used. This URL just displays the basic translator page, the results are fetched later by javascript and are shown on the page after the page has been loaded. The javascript is not processed by python-requests and this is why the class doesn't even exist in the web-page you are accessing.

The solution is to trace the packets and detect which URL is being used by javascript to fetch results. Fortunately, I have found the found the desired URL for this purpose. If you request https://translate.google.com/translate_a/single?client=webapp&sl=en&tl=fr&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=gt&source=bh&ssel=0&tsel=0&kc=1&tk=327718.241137&q=goodmorning, you will get the response of translator as JSON. You can parse the JSON to get the desired results. Here, you can face Bot detection here which can straight away throw an HTTP 403 error.

You can also use selenium to process javascript and give you results. Following changes inyour code can fix it using selenium

from selenium import webdriver
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full

class Phonetizer:
    def __init__(self,sentence : str,language_ : str = 'en'):
        self.words=sentence.split()
        self.language=language_
    def get_phoname(self):
        for word in self.words:
            print(word)
        url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
        print(url)
        #req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
        #webpage = urlopen(req).read()
        driver = webdriver.Chrome()
        driver.get(url)
        webpage = driver.page_source
        driver.close()
        f= open("debug.html","w+")
        f.write(webpage.decode("utf-8"))
        f.close()
        #print(webpage)
        bsoup = BeautifulSoup(webpage,'html.parser')
        phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
        print(phonems)
        #break
Hamza Khurshid
  • 765
  • 7
  • 18
  • I have tried your way, using `requests_html`, but it gives me nothing when I'm trying to select a CSS element, can you help me here? – Maifee Ul Asad Dec 08 '19 at 10:51
  • @MaifeeUlAsad What does _gives me nothing_ mean, exactly? What CSS element are you talking about? – AMC Dec 08 '19 at 11:06
  • `tlid-transliteration-content transliteration-content full` css element, `div` of this class ... it gives me [], array of length 0 @AlexanderCécile – Maifee Ul Asad Dec 08 '19 at 15:53
  • @MaifeeUlAsad Could you try the CSS selector method, see if that works? Writing the classes in the `class_` parameter instead of under `attrs` might do the job. – AMC Dec 08 '19 at 16:14
0

You should scrape this page with Javascript support, since the content you're looking for "hiding" inside <script> tag, which urllib does not render.
I would suggest to use Selenium or other equivalent framework.
Take a look here: Web-scraping JavaScript page with Python

Aviad Levy
  • 750
  • 4
  • 13