I'm scraping Wikipedia using BeautifulSoup4 in Python.
data = urllib2.urlopen(wikiurl)
soup = BeautifulSoup(data, 'html.parser')
I then use
for link in soup.find_all('p'):
completehtml = completehtml + str(link)
To get the HTML for a few paragraphs (The for loop has a break condition using a counter that counts the number of paragraphs and then breaks if they reach the limit)
Now after this data has been scraped. I need to enter it at a website online. (I need to enter it using the HTML which is scrapped). The problem is that some of the characters such as en-dash are not in proper HTML i.e coded in HTML, which is causing symbols to appear instead.
They print out fine in Python. But when I use methods such as pyautogui or the ActionChains class to send keys and thereby enter them using the scrapped string, they are entered as symbols.
How do I fix this. Looking for a solution in Python.
EDIT: Okay, so the main issue is when non-ascii characters are in the scrapped html. They're decoded to 'latin-1' when they're copied to clipboard or entered using the send keys method by python.
EDIT: Need to convert certain HTML entities to unicode then turn them back into HTML after replacing certain unicode substirngs.