How can I fix broken English texts from webpage?

Question

I'm working on analysis of soccer rating systems recently and got a data source from scoreboard.com.

After I parsed some sample data, realized that the data was not readable. Seems like it's in broken English texts.

Would you refer to following python code and sample result? Looking forward to see your help.

Thanks.

import requests  
import lxml.html  
import cssselect  
from bs4 import BeautifulSoup  

url = requests.get('https://www.scoreboard.com/soccer/england/premier-league-2016-2017/results/')  

urlshow = url.text  
print(urlshow)

-- sample of the result --

Premier League¬ZEE÷dYlOSQOD¬ZB÷198¬ZY÷England¬ZC÷fZHsKRg9¬ZD÷t¬ZE÷8Ai8InSt¬

score 0 · Accepted Answer · answered Dec 29 '17 at 21:44

The page is rendered in JavaScript. The text you are seeing is not displayed on the page but has a CSS attribute "display:none" applied to the div it is in. So it is not shown, just used to place data used by JavaScript on the page. I guess you want the results. To get them first install Selenium:

pip3 install selenium

Then get a driver e.g. https://sites.google.com/a/chromium.org/chromedriver/downloads (if you are on Windows or Mac you can get a headless version of Chrome - Canary if you like) put the driver in your path.

from bs4 import BeautifulSoup
from selenium import webdriver
import unicodedata

browser = webdriver.Chrome()
url = ('https://www.scoreboard.com/soccer/england/premier-league-2016-2017/results/')
browser.get(url)
html_source = browser.page_source
browser.quit()

soup =   BeautifulSoup(html_source, 'lxml')
for tr in soup.find_all('tr', {'class': 'stage-finished'}):
    for td in tr.find_all('td'):
        print (unicodedata.normalize("NFKD", td.text))

Outputs:

May 21, 03:00 PM
Arsenal 
Everton
3 : 1


May 21, 03:00 PM
Burnley
West Ham
1 : 2


May 21, 03:00 PM
Chelsea
Sunderland
5 : 1

...

If you don't want to use Selenium you can use other methods see my answer to Scraping Google Finance (BeautifulSoup)

How can I fix broken English texts from webpage?

1 Answers1