1

Using BeautifulSoup on iPython, I am trying to scrape a webpage and to get some html elements within a javascript script, but I have some issues with the encoding.

The page is in french, so with a lot of accent, and some of them are directly written in the source code, and some other are written with their html code.

example :

html_doc = """<html>
<body>
<p>voilà</p>
<p>d&eacute;j&agrave; vu</p>
<p>c'est la vie !</p>

<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>d&eacute;j&agrave; vu</p>
<p>c\'est la vie !</p>';
...
</script>
</body>
</html>"""

from bs4 import BeautifulSoup
BeautifulSoup(html_doc)

I get this result :

<html>
<body>
<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>
<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>d&eacute;j&agrave; vu</p>
<p>c'est la vie !</p>';
...
</script>
</body>
</html>

As you can see, in the first part, outside the javascript, all the accent are ok. But for the html inside the javascript, BeautifulSoup is not converting &eacute; and &agrave; into "é" and "à".

How can I solve that ?

BONUS question :

With this example, BeautifulSoup is correctly converting C\'est in C'est, but with the same apostrophe from the html page I am reading online, BeautifulSoup is keeping the "\" in the result, when the apostrophe is escaped in the javascript part, and so I get :

<html>
<body>
<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>
<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>d&eacute;j&agrave; vu</p>
<p>c\'est la vie !</p>';
...
</script>
</body>
</html>

Can you understand why ?

At the end, I want to have all the html part within the javascript as the part outside the javascript.

thanks a lot for your help ! Grégory

GregOizo
  • 43
  • 4
  • possible duplicate http://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues?rq=1 – muchwow Jun 12 '15 at 14:57
  • hi @jumojer thanks for your reply. I don't think it is the same problem. I tried to use BeautifulSoup ignoring utf-8 but I still had the same problem. Outside the javascript part, BeautifulSoup is reading all the text with the right encoding. The problem is for reading that variable, inside the javascript code, that contains html elements. – GregOizo Jun 15 '15 at 09:45

1 Answers1

1

I finally solved it.

Using Regex, I extract the html part into javascript as text, then I re-apply BeautifulSoup on that to have a readable html code :

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc)
html_from_javascript = re.findall("varHTML = '(.*)';",soup.text)
print str(BeautifulSoup(html_from_javascript[0]))

which give : <p>voilà</p><p>déjà vu</p><p>c'est la vie !</p>

And about the BONUS QUESTION :

The problem was that the initial code on the webpage was double-escaped. So the code was not C\'est but C\\\'est.

I solved it applying this function :

lambda x: x.replace("\\","")

Hope it can someone one day and that it is not a duplicate !

Grégory

GregOizo
  • 43
  • 4