-1

Currently I have utf8 charset (doctrine.dbal.charset)

How can I provide possibility for any encoding-data to store in db?

My application provides possibility for users to upload files (csv) with data. Data will be excluded and stored into database to different columns respectively.

The problem is that they upload not only utf8-encoded files. Most of converters loose/spoil data while converting (e.g.: cp1251 -> utf8)

igorpromen
  • 77
  • 7
  • oh in case this is going under in my rant: conversion to utf8 is pretty damn good. I would even claim most codepoints of all those pesky cpXXXX charsets exist in utf8. but you probably know better, right? I have fought quite some battles with charsets, and only if it was non-utf8. and from all those frustrations: just use utf8 in database, convert everything to utf8 if possible, keep everything else as original file or blob. – Jakumi Sep 21 '19 at 16:28
  • I might have been not accurate mentioning "files". Originally I parse csv-file, exclude data and store into database – igorpromen Sep 21 '19 at 16:37
  • I've got your point regarding one charset (utf-8), but it means that I need somehow convert all encodings into utf-8 what can be not possible without drawbacks – igorpromen Sep 21 '19 at 16:41
  • there are always drawbacks. maybe you can get the user to somehow provide help when it comes to identifying charsets of the files they provided? – Jakumi Sep 21 '19 at 16:53
  • also, some files also have mixed charsets. seen it. sucks. there is no perfect solution. also databases, especially mysql. sure, you can write cp1251 into a latin-1. or the good old problem with utf8 and utf8mb4 (the latter is the correct one, btw.) encodings are just fun. always. – Jakumi Sep 21 '19 at 16:55
  • I just dreamed about somewhat new in this world what could fix that problem – igorpromen Sep 21 '19 at 17:48
  • You either have to tell your users which one character encoding to use or allow them to tell you which one they have used for the file they are providing. Some systems go farther and allow users to say which MIME type their upload is, including character encoding. Or, tell them to use a file format that they (and you) don't need to be aware of which character encoding is used (e.g, .ods or .xlsx). – Tom Blodget Sep 26 '19 at 20:55
  • Yep, I already provide "notice" about acceptable encoding, and show errors in case of not utf-8 file is uploaded (mime also is checked) I just saw that goolge (googledoc sheets) doesn't suffer from this kind of issue, and I wondered if it might be an easy way nowadays – igorpromen Sep 28 '19 at 07:47

1 Answers1

0

You can declare that the client has data encoded in CHARACTER SET cp1251. Also you can specify that inside a LOAD DATA statement, which is the easiest and fastest way to read a CSV file into a MySQL table.

Unless the csv file has some screwball syntax, LOAD DATA does all the parsing, etc, for you.

I think that all cp1251 characters have a corresponding utf8 encoding. so you can (and probably should) declare doctrine.dbal.charset to be utf8 (or, better, utf8mb4). cp1251 and latin1 are supported CHARACTER SETs.

If you run into Mojibake, truncation, or question marks, see Trouble with UTF-8 characters; what I see is not what I stored

If you don't know what charset a file has, provide the hex of a few dozen characters; I can probably figure it out. MySQL will simply barf on any incorrectly specified charset.

Rick James
  • 135,179
  • 13
  • 127
  • 222