1

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.

# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"

# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"

The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?

enter image description here

I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
Jumbalaya Wanton
  • 1,601
  • 1
  • 25
  • 47
  • `'Permissões' != 'Permissões'` Is it a typo? – BroiSatse May 12 '14 at 23:10
  • @BroiSatse I don't understand what you mean. I copied everything from my terminal as is. – Jumbalaya Wanton May 12 '14 at 23:13
  • You have this weird tilde accent or whatever over e in first word and over o in the second. – BroiSatse May 12 '14 at 23:15
  • @BroiSatse No, the tilde is over the `o` in both, perhaps you having broken UTF-8 support somewhere. – mu is too short May 12 '14 at 23:17
  • @BroiSatse I'm afraid I don't understand. Here's a screen shot: https://www.evernote.com/shard/s5/sh/63df9e9b-49a9-4fa3-a89f-e0ab8c1116e6/1fd6374e64abd90ec3c6d08dff3eb926 – Jumbalaya Wanton May 12 '14 at 23:18
  • I think you're facing a normalization problem somewhere. `"\xcc\x83"` is a [combining-tilde](http://utf8-chartable.de/unicode-utf8-table.pl?start=768&number=512&utf8=0x) whereas `"\xc3\xb5"` is a simple `õ`. – mu is too short May 12 '14 at 23:19
  • @muistooshort thank you. Then how do I go about fixing this? Do I have to change something in my database, or my web server? – Jumbalaya Wanton May 12 '14 at 23:21
  • @muistooshort OK. I see I have to use a unicode normalization program... – Jumbalaya Wanton May 12 '14 at 23:24
  • @JumbalayaWanton - This is weird. My printscreen: http://i62.tinypic.com/2ptcens.jpg – BroiSatse May 12 '14 at 23:24
  • @BroiSatse What browser and OS is that? Your UTF-8 support is broken and probably broken for other [combining marks](http://utf8-chartable.de/unicode-utf8-table.pl?start=768&number=512&utf8=0x) too. – mu is too short May 12 '14 at 23:26
  • As far as beating this thing into sense goes, look at the unicode_utils gem, it should have tools for dealing with this sort of nonsense. I'd also recommend using simple IDs for your files and naming them in the headers if necessary. I'd give a proper answer but I'm short on time right now. – mu is too short May 12 '14 at 23:28
  • @muistooshort - Ubuntu 14.04, rendered on Firefox 29.0. Looks all right on Chrome though. WTH? – BroiSatse May 12 '14 at 23:30
  • @muistooshort thank you for pointing me in the right direction. Whenever you can, if you could point me to some literature on naming the files in the header I would appreciate it. Thanks! – Jumbalaya Wanton May 12 '14 at 23:30
  • Maybe have a look over here: http://stackoverflow.com/q/93551/479863 – mu is too short May 12 '14 at 23:32
  • @BroiSatse Maybe send a bug report in to the Ubuntu Firefox people, renders correctly in OSX Firefox. – mu is too short May 12 '14 at 23:34
  • Are you sure it is a database problem? If you are using OS X note that if you actually save a file called `Permissões.pdf` (combined letter+accent) the filesystem will do a Unicode normalisation step to decomposed (separate letter and accent). This is a really bad bit of design but we are stuck with it. – bobince May 13 '14 at 10:23
  • @bobince I'm not sure it is the database. It could be the JavaScript that I am using to prepare and submit the form data (name and path). I use `jquery-fileupload`. When files are added I am able to access a data object that gives me the name and path of the file. It could be here that the characters are getting mangled. – Jumbalaya Wanton May 13 '14 at 12:12
  • @muistooshort since I am using Rails, I found `ActiveSupport::Multibyte::Unicode` that has a `normalize()` method. Setting the filename in the headers seems pretty complicated. – Jumbalaya Wanton May 13 '14 at 12:14
  • You can put that down as a self-answer and accept that answer. Unicode has lots of interesting little traps like this so it might help someone in the future. – mu is too short May 13 '14 at 16:39

1 Answers1

0

To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.

before_save do
  self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end
Jumbalaya Wanton
  • 1,601
  • 1
  • 25
  • 47