Encoding issue HTML Â, solution using regex?

Question

I know there have been multiple discussions on this topic, one of them is this: HTML encoding issues - "Â" character showing up instead of " "

I did follow it, but I want to solve this without adding any "meta charset" tag to my html, in fact, I am deleting all tags from the header (Nokogiri has some issues with that ). Is there any possible regex that I can use to eliminate these Â's from my output? I am throwing my output to "csv" and I can see those Â's in the csv file.

Thanks!

I agree, I tried adding Nokogiri::HTML("filename",'utf-8'), but it still won't work. — Rohan Dalvi, Sep 25 '13 at 16:01
The best solution is to make sure your encodings are correct from the start. Any other attempt is an uphill-battle. Without a sample of your code and a small example of the HTML we're only guessing and throwing out opinions about what you should do, which isn't the Stack Overflow way. " Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See http://SSCCE.org for guidance." — the Tin Man, Sep 25 '13 at 16:22

score 3 · Accepted Answer · answered Sep 25 '13 at 15:39

If you intend to fix the problem that a UTF-8 encoded document is interpreted as ISO-8859-1, then you just need to write a regular expression that maps the UTF-8 encoded forms of Unicode characters (about 100,000 in total) to the correct characters. Obviously, this is a Bad Idea from the beginning.

Encoding issue HTML Â, solution using regex?

1 Answers1