Using BeautifulSoup on iPython, I am trying to scrape a webpage and to get some html elements within a javascript script, but I have some issues with the encoding.
The page is in french, so with a lot of accent, and some of them are directly written in the source code, and some other are written with their html code.
example :
html_doc = """<html>
<body>
<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>
<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>déjà vu</p>
<p>c\'est la vie !</p>';
...
</script>
</body>
</html>"""
from bs4 import BeautifulSoup
BeautifulSoup(html_doc)
I get this result :
<html>
<body>
<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>
<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>';
...
</script>
</body>
</html>
As you can see, in the first part, outside the javascript, all the accent are ok. But for the html inside the javascript, BeautifulSoup is not converting é
and à
into "é" and "à".
How can I solve that ?
BONUS question :
With this example, BeautifulSoup is correctly converting C\'est
in C'est
, but with the same apostrophe from the html page I am reading online, BeautifulSoup is keeping the "\" in the result, when the apostrophe is escaped in the javascript part, and so I get :
<html>
<body>
<p>voilà</p>
<p>déjà vu</p>
<p>c'est la vie !</p>
<script type="text/javascript">
...
varHTML = '<p>voilà</p>
<p>déjà vu</p>
<p>c\'est la vie !</p>';
...
</script>
</body>
</html>
Can you understand why ?
At the end, I want to have all the html part within the javascript as the part outside the javascript.
thanks a lot for your help ! Grégory