0

I'm scraping Wikipedia using BeautifulSoup4 in Python.

data = urllib2.urlopen(wikiurl)
soup = BeautifulSoup(data, 'html.parser')

I then use

for link in soup.find_all('p'):
   completehtml = completehtml + str(link)

To get the HTML for a few paragraphs (The for loop has a break condition using a counter that counts the number of paragraphs and then breaks if they reach the limit)

Now after this data has been scraped. I need to enter it at a website online. (I need to enter it using the HTML which is scrapped). The problem is that some of the characters such as en-dash are not in proper HTML i.e coded in HTML, which is causing symbols to appear instead.

They print out fine in Python. But when I use methods such as pyautogui or the ActionChains class to send keys and thereby enter them using the scrapped string, they are entered as symbols.

How do I fix this. Looking for a solution in Python.

EDIT: Okay, so the main issue is when non-ascii characters are in the scrapped html. They're decoded to 'latin-1' when they're copied to clipboard or entered using the send keys method by python.

EDIT: Need to convert certain HTML entities to unicode then turn them back into HTML after replacing certain unicode substirngs.

1 Answers1

0

I believe the solution to this post would give you what you need: Convert HTML entities to Unicode and vice versa

Community
  • 1
  • 1
c3st7n
  • 1,891
  • 13
  • 15