1

I am trying to get pronunciations of dictionary words using Python, I have tried many different libraries, and long story short, no library is significantly better than the rest, because of the sheer number of English words, there isn't a library that even covers a quarter of the vocabulary.

So I decided to write my own code to get pronunciations, in short, I want to write a Grapheme-to-Phoneme Neural Network, and I need to get the pronunciations first. I quickly decided that the only feasible, reliable way is to get pronunciations from the internet.

There are many free online dictionaries on the internet, and most of them (if not all, I am not sure) list pronunciations of the words, by combining results from multiple sources the coverage and accuracy of the final results would be better.

Now here is the thing, most of them don't even have APIs, those do have free API with limited daily query quota and a paid version with unlimited queries.

I need to process a very large corpus (105230 words) and I have very little money, and my project isn't commercial, so I opted to scrape directly from the webpages instead, they don't require a fee to give you unlimited queries.

I have even managed to create a working Google Dictionary API:

'https://www.google.com/search?q=define+{word}&client=firefox-b-d&gl=us&hl=en&newwindow=1'

I initially did this using requests and lxml.html, but it is slow due to my "special" network conditions and prone to HTTP CODE 429.

I have of course passed User-Agent and it is useless:

UA = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0'}

So I used Selenium to do this instead, the code is working perfectly and is no longer rate-limited but it is still slow.

Code:

python
import os
import re
from selenium import webdriver
from selenium.webdriver import FirefoxProfile
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--log-level=3")
options.add_argument("--mute-audio")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument('--disable-extensions')
options.add_argument('--disable-gpu')

capabibilties = DesiredCapabilities().FIREFOX
capabibilties['pageLoadStrategy'] = 'eager'

profile = FirefoxProfile(os.environ['localappdata'] + '\\Mozilla\\Firefox\\Profiles\\Selenium')
profile.set_preference("http.response.timeout", 1)
profile.set_preference("dom.max_script_run_time", 3)
Firefox = webdriver.Firefox(capabilities=capabibilties, options=options, firefox_profile=profile)

def collins(word):                                                     
    Firefox.get(f'https://www.collinsdictionary.com/us/dictionary/english/{word}')
    element = Firefox.find_element_by_xpath("//span[@class='pron type-']")
    return element.get_attribute("textContent").strip()

As you can see, I have tried to speed up execution by enabling headless mode and setting eager pageload strategy, but it is still slow, the eager pageload strategy seemed to have no effect.

In [55]: %time collins('transcendence')
Wall time: 6.12 s
Out[55]: 'trænsɛndəns'

A single query takes 6.12 seconds!

Now here is the thing, as I have mentioned my network condition is special, I am behind the Great Firewall of China, and I will spare you the details and intricacies and such, simply put it is a crime against freedom of information as it blocks foreign websites.

Luckily I use ExpressVPN, with it I can access blocked websites like Google, and long story short, it increases latency but without it, I can't even access the blocked websites.

Here is a simple ping test:

PS D:\sequitur> ping www.collinsdictionary.com

Pinging www.collinsdictionary.com [104.20.66.159] with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 104.20.66.159:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),
PS D:\sequitur> ping www.collinsdictionary.com
Ping request could not find host www.collinsdictionary.com. Please check the name and try again.
PS D:\sequitur> ping www.collinsdictionary.com

Pinging www.collinsdictionary.com [104.20.66.159] with 32 bytes of data:
Reply from 104.20.66.159: bytes=32 time=348ms TTL=58
Reply from 104.20.66.159: bytes=32 time=348ms TTL=58
Reply from 104.20.66.159: bytes=32 time=348ms TTL=58
Reply from 104.20.66.159: bytes=32 time=348ms TTL=58

Ping statistics for 104.20.66.159:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 348ms, Maximum = 348ms, Average = 348ms

Without the VPN I can't even access www.collinsdictionary.com and with it the latency is so big, and this is the result while connected to the best server.

But I still have 100+mbps download bandwidth.

Now here is the thing, if I visit https://www.collinsdictionary.com/us/dictionary/english/transcendence the page takes about 2.5 seconds to load at most, possibly around 2 seconds or less I am not sure, but definitely much shorter than the time taken by the function.

But the status bar remained busy displaying loading external websites, it is busy for approximately the same time taken by the code.

Now I have used selenium-wire and here is what I found:

Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.28.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from seleniumwire import webdriver

In [2]: driver = webdriver.Firefox()

In [3]: driver.get('https://www.collinsdictionary.com/us/dictionary/english/transcendence')

In [4]: driver.requests
Out[4]:
#very long output, that if listed will cause the body to exceed the 30000 character limit, it has 48903 characters exactly and contains 88 requests just by one action.

In [5]: {(i.method, i.url) for i in driver.requests}
Out[5]:
{('GET', 'http://detectportal.firefox.com/canonical.html'),
 ('GET', 'http://detectportal.firefox.com/success.txt?ipv4'),
 ('GET', 'http://detectportal.firefox.com/success.txt?ipv6'),
 ('GET', 'https://api.polarbyte.com/getDeviceId'),
 ('GET',
  'https://api.polarbyte.com/getProfile?d=%7B%22did%22%3A%22118643ee-7b81-21de-be2c-88d3071ad39a%22%7D'),
 ('GET',
  'https://api.polarbyte.com/save?d=%7B%22lang%22%3A%22en-US%22%2C%22page_cat%22%3A%22dictionary%22%2C%22page_type%22%3A%22entry%22%2C%22dict_code%22%3A%22english%22%2C%22entry_id%22%3A%22transcendence%22%2C%22project%22%3A%22HCD%22%2C%22_libv%22%3A%222.0%22%2C%22_did%22%3A%22118643ee-7b81-21de-be2c-88d3071ad39a%22%2C%22_ssize%22%3A%221920x1080%22%2C%22_pl_p%22%3A%22%22%2C%22_url%22%3A%22https%3A%2F%2Fwww.collinsdictionary.com%2Fus%2Fdictionary%2Fenglish%2Ftranscendence%22%2C%22_lang%22%3A%22en-US%22%2C%22_con%22%3Atrue%2C%22_lt%22%3A56915172%7D'),
 ('GET',
  'https://boot.pbstck.com/v1/tag/fa747523-2b8f-4a1f-befd-08f561031537'),
 ('GET',
  'https://cdn.cookielaw.org/consent/2cce4478-3712-4741-b72d-1210a930e08f/2cce4478-3712-4741-b72d-1210a930e08f.json'),
 ('GET',
  'https://cdn.cookielaw.org/consent/2cce4478-3712-4741-b72d-1210a930e08f/f1a60226-4585-4be5-904d-5f1167fa23d5/en.json'),
 ('GET',
  'https://cdn.cookielaw.org/logos/2ae1f452-e8c9-4d90-b0bf-a20e7f5d026e/2cce4478-3712-4741-b72d-1210a930e08f/4a3e9031-8291-4080-8464-67e2b8aad9ca/hcd_logo.png'),
 ('GET',
  'https://cdn.cookielaw.org/scripttemplates/6.25.0/assets/otCommonStyles.css'),
 ('GET',
  'https://cdn.cookielaw.org/scripttemplates/6.25.0/assets/otFlat.json'),
 ('GET', 'https://cdn.cookielaw.org/scripttemplates/6.25.0/otBannerSdk.js'),
 ('GET', 'https://cdn.cookielaw.org/scripttemplates/otSDKStub.js'),
 ('GET', 'https://cdn.pbstck.com/index-monitoring-1cd83bb.js'),
 ('GET', 'https://d1yu67rmchodpo.cloudfront.net/audience.js'),
 ('GET',
  'https://fonts.gstatic.com/s/opensans/v17/mem8YaGs126MiZpBA-UFVZ0b.woff2'),
 ('GET',
  'https://fonts.gstatic.com/s/zillaslab/v5/dFa5ZfeM_74wlPZtksIFYoEf6HOpWw.woff2'),
 ('GET',
  'https://fonts.gstatic.com/s/zillaslab/v5/dFa6ZfeM_74wlPZtksIFajo6_Q.woff2'),
 ('GET', 'https://geolocation.onetrust.com/cookieconsentpub/v1/geo/location'),
 ('GET',
  'https://geolocation.onetrust.com/cookieconsentpub/v1/geo/location/geofeed'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/ads-track-digest256/1633028676'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/allow-flashallow-digest256/1490633678'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/analytics-track-digest256/1637080483'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/base-cryptomining-track-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/base-fingerprinting-track-digest256/1637080483'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/block-flash-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/block-flashsubdoc-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/content-track-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/except-flash-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/except-flashallow-digest256/1490633678'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/except-flashsubdoc-digest256/1517935265'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/google-trackwhite-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/mozstd-trackwhite-digest256/1633028676'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/social-track-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/social-tracking-protection-facebook-digest256/1604686195'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/social-tracking-protection-linkedin-digest256/1564526481'),
 ('GET',
  'https://tracking-protection.cdn.mozilla.net/social-tracking-protection-twitter-digest256/1604686195'),
 ('GET', 'https://www.collinsdictionary.com/us/apple-touch-icon.png'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common_defer.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common_entry.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common_hooks.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common_quiz.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_javascripts/common_stickyheader.js?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_responsive/min1240.css?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_responsive/min762.css?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/common_stylesheets/adserver.css?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/dictionary/english/transcendence'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/fonts/icomoon.ttf?1pqdoj&version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/cobuild-logo.png?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/cross_icon.svg?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/hooks/hook_quiz.svg?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/logo.png?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/placeholder.png?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/images/scrabble_background.jpg?version=4.0.198'),
 ('GET',
  'https://www.collinsdictionary.com/us/external/scripts/pb-hcd.min.js?version=4.0.198'),
 ('GET', 'https://www.collinsdictionary.com/us/favicon-16x16.png'),
 ('GET', 'https://www.collinsdictionary.com/us/iaw.min.js?version=4.0.198'),
 ('GET', 'https://www.collinsdictionary.com/us/required.js?version=4.0.198'),
 ('POST', 'http://ocsp.digicert.com/'),
 ('POST', 'http://r3.o.lencr.org/')}

In [6]: 'https://www.collinsdictionary.com/us/favicon-16x16.png'.split('/')
Out[6]: ['https:', '', 'www.collinsdictionary.com', 'us', 'favicon-16x16.png']

In [7]: {(i.method, i.url.split('/')[2]) for i in driver.requests}
Out[7]:
{('GET', 'api.polarbyte.com'),
 ('GET', 'boot.pbstck.com'),
 ('GET', 'cdn.cookielaw.org'),
 ('GET', 'cdn.pbstck.com'),
 ('GET', 'd1yu67rmchodpo.cloudfront.net'),
 ('GET', 'detectportal.firefox.com'),
 ('GET', 'fonts.gstatic.com'),
 ('GET', 'geolocation.onetrust.com'),
 ('GET', 'tracking-protection.cdn.mozilla.net'),
 ('GET', 'www.collinsdictionary.com'),
 ('POST', 'ocsp.digicert.com'),
 ('POST', 'r3.o.lencr.org')}

Long story short, to make a simple query to get the pronounciation of the word transcendence I made 88 requests to 10 external websites and most of the websites are trackers and advertisement services that need to be blocked, and they contribute the bulk of the io overhead:

PS D:\sequitur> ping geolocation.onetrust.com

Pinging geolocation.onetrust.com [104.20.184.68] with 32 bytes of data:
Reply from 104.20.184.68: bytes=32 time=340ms TTL=58
Reply from 104.20.184.68: bytes=32 time=340ms TTL=58
Reply from 104.20.184.68: bytes=32 time=340ms TTL=58
Reply from 104.20.184.68: bytes=32 time=340ms TTL=58

Ping statistics for 104.20.184.68:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 340ms, Maximum = 340ms, Average = 340ms

I understand if I block the URLs to prevent them from loading the execution time will be shortened tremendously. I know I can do this by simply editing C:\Windows\System32\drivers\etc\hosts file and adding lines like 127.0.0.1 cdn.pbstck.com, then ipconfig, but I don't know if system-wide block is a good idea because some of these belonged to Google and way too many websites use Google API/services and if I block addresses like fonts.gstatic.com (they do take a very long time to load completely) I simply can't browse those websites normally, but if I block the websites in selenium they won't be able to hog execution time.

And the images and stylesheets, because I am not viewing the websites at all they are completely useless here and their sheer size and time to load harm performance, and they also need to be blocked.

I tried to find a way to disable image loading in Python selenium Firefox geckodriver and only found this: https://stackoverflow.com/a/31626640/16383578, the method no longer works with Firefox, and I have tried imageblock extension but it doesn't seem to be working.

Chrome supports blocking images but it doesn't support eager pageloadstrategy, though it doesn't help here.

And how do I prevent external URLs from loading? At least something like hosts file but not system-wide, but that isn't very good, because I use multiple sources and I have to inspect the requests and add the URLs one by one...

I am thinking about a global switch or something that detects whether the URL is internal to the target website or external and simply refuses to load the URL if it is external, this will speed up execution significantly, but I have not found anything relevant so far...

And I am thinking about a timeout, like setting a 2-second timeout for each pageloading, that if the eager strategy fails to take its effect and all other methods failed, immediately stop the page loading process no matter what, whether it is interactive or not, and start following code execution...

How can I do these?

BTW, I am using ublock origin in my normal profile and not in the selenium profile, and is it relevant to the relatively faster loading time?

TBA
  • 1,921
  • 4
  • 13
  • 26
Ξένη Γήινος
  • 2,181
  • 1
  • 9
  • 35

0 Answers0