3

I'm trying to import a CSV file (UTF-8 encoding) in Ruby (2.0.0) in to my database (MSSQL 2008R2, COLLATION French_CI_AS), but the special characters (French accents on vowels) are not stored properly : éèçôü becomes éèçôü (or other similar jibberish).

I use this piece of code to read the file :

CSV.foreach(file, col_sep: ';', encoding: "utf-8") do |row|
   # ...
end

I tried various encoding in the CSV options (utf-8, iso-8859-1, windows-1252), but none would store the special characters correctly.

Before you ask, my database collation supports those characters, since we have successfully imported data containing those using PHP importers. If I dump the data using puts or a file logger, everything is correct.

Is something wrong with my code, or do I need to specify something else (like the ruby class file encoding for example) ?

Thanks

EDIT : The data saving is done by a PHP REST API that works fine with accented characters. It stores data as it is received.

In Ruby, I parse my data, store it in an object and then send the JSON-encoded object in the body of my PUT request. But if I use an SQL query directly from Ruby, the problem remains :

query = <<-SQL
    UPDATE MyTable SET MyTable_title = '#{row_data['title']}' WHERE MyTable_id = '#{row_data['id']}'
SQL
res = db.execute query
3rgo
  • 3,115
  • 7
  • 31
  • 44
  • You have dumped the data and everything was correct, because you read `utf-8`-encoded file with `utf-8` reader. The problem is obviously in _inserting into database_. Would you mind to share the code you use to store it into database? I bet putting `str.encode('iso-8859-1')` instead of putting just `str` would resolve an issue. – Aleksei Matiushkin May 12 '15 at 14:55
  • Edited my post to answer your question. Adding a manual encode does not solve the problem, the stored value is still jibberish... – 3rgo May 12 '15 at 15:00
  • The API can’t “work fine with accented characters” because there is no notion of “accented character” for an API. It just receives a bytestream and passed it further to the mysql adapter. It matters that mysql `client_encoding` parameter _must_ match the real encoding of this bytestream. So, manual encoding would be necessary. – Aleksei Matiushkin May 12 '15 at 15:04
  • We use a PHP microframework along with Doctrine2 to link to the Database objects. We never had any issue with accented characters, if they are correctly passed in the request, they will be correctly stored – 3rgo May 12 '15 at 15:09
  • I can’t get what do you mean by “passed correctly.” “ç” in UTF-8 is 2 bytes, the same symbol in ISO-8859-1 is 1 byte. What representation do you call correct? Whether the API expects ISO-8859-1, and you have all the time passed ISO-8859-1, I would suggest you to tell ruby to convert the result to ISO-8859-1 as well. – Aleksei Matiushkin May 12 '15 at 15:12
  • We use UTF-8 as our main charset. If the character is properly encoded in the data sent in the request to the API, the storage will be OK – 3rgo May 12 '15 at 15:17
  • Can I see some more ruby code... What happens if you `puts query` before the `db.execute` line – Ismail Moghul May 21 '15 at 00:45
  • What SQL gem are you using? – Ismail Moghul May 21 '15 at 00:50
  • Why not import the csv directly using SSIS? – Zohar Peled Jun 14 '15 at 07:58

1 Answers1

0

I was thinking that this had something to do with the encoding type on your CSV file, so started digging around on that. I did find that windows-1252 encoding will insert control characters.

You can read more about it here: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

Community
  • 1
  • 1
cesta
  • 34
  • 1
  • 5