1

I was coding a crawler and taking titles of a non english websites.When I am printing titles in my console getting like this:

শà§à¦°à§à¦²à¦à§à¦à¦¾à¦° ভালৠসà§à¦à¦¨à¦¾
ফà¦à¦¿à¦°à¦¾à¦ªà§à¦²à§ হাতবà§à¦®à¦¾ বিসà§à¦«à§à¦°à¦£, à¦à¦à¦ ১৬
দà§à¦ বাà¦à¦²à¦¾à¦¦à§à¦¶à¦¿à¦à§ নিà§à§ à¦à§à¦à§ বিà¦à¦¸à¦à¦«
à¦à¦¾à¦®à¦¾à§à¦¾à¦¤ নà§à¦¤à¦¾ সà§à¦²à¦¿à¦®à¦¸à¦¹ দà§à¦à¦¨ à¦à§à¦°à§à¦ªà§à¦¤à¦¾à¦°

I dont have any idea, how to get proper text from the above strings.

Any idea?

Thanks in advance.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • First rule about character encodings: Always use Unicode. Second rule of character encodings: Always know exactly what encoding you use when reading text or outputting it. – Joey Mar 08 '13 at 08:27
  • the encoding should be in the http header. You might want to see this: http://stackoverflow.com/questions/4400678/http-header-should-use-what-character-encoding – monkut Mar 08 '13 at 08:28

1 Answers1

5

This looks like UTF-8 encoded Bengali text with interspersed HTML character references, incorrectly interpreted as windows-1252 characters. Could be about anything else, too, really.

When crawling web pages, you should do roughly what browsers and general search engines do when deciding on the character encoding. This is far from trivial. In HTML5 RC, section 8.2.2.1 Determining the character encoding is an attempt at describing the process.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • " Bengali text with interspersed HTML character references" yes you are right its Bengali text. –  Mar 08 '13 at 08:34
  • the link you provided is descriptive,is there any hands on reference of solving the issue? –  Mar 08 '13 at 08:37