0

I am testing a website which contains user reviews for hotels which can be in any lang...i.e, Czech or russian or spanish or english etc... and have a question about how can i check in which language the text is by reading the text using Selenium.

for example i am using to element.text method of selenium to read the text, then i want to place a check if its in english then do this else for any other language do that.

This is one of the html element

<div class="innerBubble">
<div class="quote"><a href="/ShowUserReviews-g1-d8729164-r427772133-TAP_Portugal-World.html#CHECK_RATES_CONT" onclick="ta.setEvtCookie('Reviews','title','',0,this.href); setPID();" id="r427772133">“<span class="noQuotes">TRES SATISFAITS</span>”</a></div>
<div class="rating reviewItemInline">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s40" width="56" src="https://static.tacdn.com/img2/x.gif" alt="4 of 5 stars">
</span>
<span class="ratingDate relativeDate" title="October 13, 2016">Reviewed 3 days ago
<span class="new redesigned">NEW</span> </span>
</div>
<div class="googleTranslation reviewItem">
<span class="link" onclick="ta.call('ta.overlays.Factory.reviewTranslate', event, this, '/MachineTranslation?g=1&amp;d=8729164&amp;r=427772133&amp;page=review&amp;sl=fr&amp;tl=en'); ta.trackEventOnPage('Reviews', 'google_translate')">
<img alt="Google Translation" src="https://static.tacdn.com/img2/buttons/googleTranslation.gif">
</span>
</div>
<div class="entry">
<p>
Un peu d'appréhension avant mais vite levée. Très bon accueil et bon service de la part des pnc, repas chaud et bon, même pour ce court vol (1h50). Bonne ponctualité et embarquement des plus efficaces
</p>
</div>
thebadguy
  • 2,092
  • 1
  • 22
  • 31
  • Does the HTML tag include the [language code](http://www.w3schools.com/tags/ref_language_codes.asp)? That could be an easy way to figure it out. There's also a lot of packages out there like NTLK that can detect natural languages. – sytech Oct 17 '16 at 12:53
  • No it does not have language code as its pulling the text using java script function – thebadguy Oct 17 '16 at 12:55
  • I mean the `` tag for the entire document. In case that wasn't clear. – sytech Oct 17 '16 at 12:57
  • see this link...view-source:https://www.tripadvisor.com/Airline_Review-d8729164-Reviews-Cheap-Flights-or560-TAP-Portugal#review_344214941 – thebadguy Oct 17 '16 at 12:59
  • 1
    Ok, I understand now. Take a look at the `` tag which is for the Google Translate button. Part of the `onclick` attribute gives away the language. Specifically: `sl=fr&tl=en` tells you that the button will use Google Translate to go from french (`fr`) to english (`en`). You could use this to determine the origin language for each review. – sytech Oct 17 '16 at 13:08

2 Answers2

0

Detect a language it is not trivial unless the html tag put the current lang.

If you are using selenium in python you can use this function, for that you need to install nltk and the corpus stopwords:

from nltk import word_tokenize
from nltk.corpus import stopwords
def detect_lang(text):
    lang_ratios = {}

    tokens = word_tokenize(text)
    words = [word.lower() for word in tokens]

    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)

        lang_ratios[language] = len(common_elements)
    return max(lang_ratios, key=lang_ratios.get)

With this function you can ask for the lang used:

lang = detect_lang(text)
    if(lang == 'english'):
         ...
Fran Lendínez
  • 354
  • 2
  • 10
0

Here Nothing to do with selenium driver just get the text and use the below code If you need to detect language in response to a user action then you could use google ajax language API:

#!/usr/bin/env python
import json
import urllib, urllib2

def detect_language(text,
userip=None,
referrer="http://stackoverflow.com/q/4545977/4279",
api_key=None):        

query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
if userip: query.update(userip=userip)
if api_key: query.update(key=api_key)

url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
    urllib.urlencode(query))

request = urllib2.Request(url, None, headers=dict(Referer=referrer))
d = json.load(urllib2.urlopen(request))

if d['responseStatus'] != 200 or u'error' in d['responseData']:
    raise IOError(d)

return d['responseData']['language']

print detect_language("Python - can I detect unicode string language code?")

OUTPUT

en
Deepesh kumar Gupta
  • 884
  • 2
  • 11
  • 29