python 3 unable to scrape

Question

I am trying to translate Indonesian language to English using Google translate.(because I play a game that has a lot of Indonesians)

lang = id
inp = input("Enter to translate: \n").replace(" ","%20")

htmlfile = Request("https://translate.google.co.in/#" + lang + "/en/" + inp, headers = {'User-Agent': 'Mozilla/5.0'})  
htmltext = urlopen(htmlfile).read().decode('utf-8')
regex = '<span id="result_box" class="short_text" lang="en">(.+?)</span>'
pattern = re.compile(regex)
trans = re.findall(pattern, htmltext)
print(trans)

when I give the input I get []. Here is the inspect element

<span id="result_box" class="short_text" lang="en">

 <span class="hps">

    greeting

 </span>

I need to get that "greeting" part

Avinash Raj · Answer 1 · 2014-10-26T17:05:31.860

It's not the problem with urllib, problem is mainly because of your regex. By default . in your regex would match any character but not of newline or carriage return characters. You need to enable DOTALL mode (?s) to make . to match even newline characters also.

regex = r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>'

Example:

>>> import re
>>> s = """<span id="result_box" class="short_text" lang="en">
... 
...  <span class="hps">
... 
...     greeting
... 
...  </span>"""
>>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>', s)
['\n\n <span class="hps">\n\n    greeting\n\n ']
>>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(?:(?!</).)*?(\w+)\s*</span>', s)
['greeting']

score 0 · Answer 2 · edited May 23 '17 at 12:05

Caveats:

First off, I would advise you read the famous answer to the question about how to parse HTML with regular expresssions. TL;DR: Don't do it. Use BeautifulSoup instead.

That said, I'm not a lawyer, but what you are trying to do might be in violation of Google's Terms of Service. They have a paid API, charging 20 USD per 1M characters of text (as of 26/Oct/2014), which might better suit your needs. Using the API has the additional benefit of protecting you from changes to the markup used that could otherwise break your code.

If you do want to pursue this path:

Your regular expression is not matching newlines. You need to specify the DOTALL flag when you compile your regular expression. Your updated code could be:

lang = "id"
inp = input("Enter text to translate:\n").replace(" ","%20")

htmlfile = Request("https://translate.google.co.in/#" + lang + "/en/" + inp, 
                   headers={'User-Agent': 'Mozilla/5.0'})  
htmltext = urlopen(htmlfile).read().decode('utf-8')
pattern = re.compile(, )
trans = re.findall(
    '<span id="result_box" class="short_text" lang="en">\\s+<span class="hps">(.+?)</span>',
    htmltext,
    re.DOTALL)
print(trans)

Note that the regular expression adds <span class="hps"> to exclude that from your matching text.

python 3 unable to scrape

2 Answers2

Caveats:

If you do want to pursue this path: