-2

I am trying to scrape a webpage whose charset like this

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and when I get the page source using python requests, I get content like this:

&#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;

How can I get original content out of these string in python?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Arman
  • 25
  • 1
  • 4

1 Answers1

0

These are HTML entities encoding Unicode codepoints, and are not really using UTF-8; it could have been encoded as ASCII without loss of functionality. Use a HTML parser, such as BeautifulSoup. It'll handle such content for you:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
... </head><body>
... &#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;
... </body></html>''', 'lxml')
>>> soup
<html><head><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n</head><body>\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 <p>\u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n</p></body></html>
>>> soup.get_text()
u"\n\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 \u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n"
>>> print soup.get_text()


কথায় বলে- মাছে ভাতে বাঙালি। কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343