0

I know there have been multiple discussions on this topic, one of them is this: HTML encoding issues - "Â" character showing up instead of " "

I did follow it, but I want to solve this without adding any "meta charset" tag to my html, in fact, I am deleting all tags from the header (Nokogiri has some issues with that ). Is there any possible regex that I can use to eliminate these Â's from my output? I am throwing my output to "csv" and I can see those Â's in the csv file.

Thanks!

Community
  • 1
  • 1
Rohan Dalvi
  • 1,215
  • 1
  • 16
  • 38
  • 4
    Worst idea for fixing a character encoding issue ever. – Alohci Sep 25 '13 at 15:15
  • I agree, I tried adding Nokogiri::HTML("filename",'utf-8'), but it still won't work. – Rohan Dalvi Sep 25 '13 at 16:01
  • The best solution is to make sure your encodings are correct from the start. Any other attempt is an uphill-battle. Without a sample of your code and a small example of the HTML we're only guessing and throwing out opinions about what you should do, which isn't the Stack Overflow way. " Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See http://SSCCE.org for guidance." – the Tin Man Sep 25 '13 at 16:22

1 Answers1

3

If you intend to fix the problem that a UTF-8 encoded document is interpreted as ISO-8859-1, then you just need to write a regular expression that maps the UTF-8 encoded forms of Unicode characters (about 100,000 in total) to the correct characters. Obviously, this is a Bad Idea from the beginning.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390