How do I encode specific characters to HTML in python

Question

I'm scraping Wikipedia using BeautifulSoup4 in Python.

data = urllib2.urlopen(wikiurl)
soup = BeautifulSoup(data, 'html.parser')

I then use

for link in soup.find_all('p'):
   completehtml = completehtml + str(link)

To get the HTML for a few paragraphs (The for loop has a break condition using a counter that counts the number of paragraphs and then breaks if they reach the limit)

Now after this data has been scraped. I need to enter it at a website online. (I need to enter it using the HTML which is scrapped). The problem is that some of the characters such as en-dash are not in proper HTML i.e coded in HTML, which is causing symbols to appear instead.

They print out fine in Python. But when I use methods such as pyautogui or the ActionChains class to send keys and thereby enter them using the scrapped string, they are entered as symbols.

How do I fix this. Looking for a solution in Python.

EDIT: Okay, so the main issue is when non-ascii characters are in the scrapped html. They're decoded to 'latin-1' when they're copied to clipboard or entered using the send keys method by python.

EDIT: Need to convert certain HTML entities to unicode then turn them back into HTML after replacing certain unicode substirngs.

Do you need to unescape the HTML? I.e., replace `©` with © ? — Marian, Nov 04 '16 at 08:28
No the opposite, so when an en-dash is entered. I need it as `–` instead of `–` — bluescreenofdeath2016, Nov 04 '16 at 08:49
Use search-and-replace. [Here](https://dev.w3.org/html5/html-author/charref) is a list. — Jongware, Nov 04 '16 at 10:36
`html.escape()` for Python 3. See http://stackoverflow.com/questions/2087370/ for more info — Marian, Nov 04 '16 at 11:45

score 0 · Accepted Answer · edited May 23 '17 at 12:19

0

I believe the solution to this post would give you what you need: Convert HTML entities to Unicode and vice versa

edited May 23 '17 at 12:19

Community

1
1

answered Nov 04 '16 at 10:30

c3st7n

1,891
13
15

That did the trick. Thank you so much. – bluescreenofdeath2016 Nov 05 '16 at 01:05

How do I encode specific characters to HTML in python

1 Answers1