0

What I'm doing

I'm working on a chat application (written in PHP), which allows users to create their own chat rooms. A user may name a chat room anything they like and this name is passed on to the MySQL database in a prepared statement as the table name for that respective chat room.

It is understood that there is no log in / security measure for this application and the table holding the chat log is composed of records with simply the user submitted text and timestamp (2 columns, without counting an AUTO_INCREMENT primary key).

What I'm facing

Given the simple nature of this application, I don't have the intention of changing the structure of the database, but I'm now running into the issue when a user enters emoji (or other supplementary characters) as the name for their own chat room. Passing such information on to the database as is will convert the characters into question marks, due to the way MySQL works internally (https://dev.mysql.com/doc/refman/5.7/en/identifiers.html):

Identifiers are converted to Unicode internally. [..] ASCII NUL (U+0000) and supplementary characters (U+10000 and higher) are not permitted in quoted or unquoted identifiers.

What should / can I do to avoid this problem? Is there a best practice for "escaping" / "sanitizing" user input in a situation like this? I put the respective words in quotation marks because I know it is not the proper / typical way of handling user input in a database.

What I'm trying

An idea I had was using rawurlencode() to literally break down the supplementary characters into unique sequences that I can pass on to the database and still be sure that a chat room with the name is not confused with . However, I have the impression based on this answer that this is not good practice: https://stackoverflow.com/a/8700296/1564356.

Tackling this issue another way, I thought of base64_encode(), but again based on this answer it is not an ideal approach: https://stackoverflow.com/a/24175941/1564356. I'm wondering however, if in this case it would still be an acceptable one.

A third option would be to construct the database in a different way by issuing unique identifiers as the table names for each respective chat room and storing the utf8mb4 compatible string in a column. Then a second table with the actual chat log can be linked with a foreign key. This however complicates the structure of the database and doubles the amount of tables required. I'm not a fan of this approach.

Any ideas? Thanks!

IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • No doubt what IMSoP answered below is definitely the most correct and logical way to do what you're actually doing. But just out of curiosity, your base64 approach doesn't seems as bad as what described within the answer, they are mostly pointing out the disadvantages IF the encoded value is data, ordering/searching isn't really relevant when it is table name, but I'd prefer hex value so I can restrict the characters to 0-F though, eg: "Group A" = "47726f75702041". At the end of the day, if its not too troublesome, just go with IMSoP's method. – Chor Wai Chun Oct 19 '17 at 09:31
  • 1
    You might be right, thanks for pointing that out! I know for a fact that this app won't be scaled, so Base64 encoding will not dramatically affect anything. I'm going with constructing the database anew, always good to practice good habits. – Sharam Patel Oct 19 '17 at 17:48

2 Answers2

2

Dynamically created tables, regardless of their naming scheme, are very rarely a sensible design choice. They make every single query you write more complicated, and eliminate a large part of the usefulness of SQL as a language and relational databases as a concept.

Furthermore, allowing users to directly choose table names sounds like a security disaster waiting to happen. Prepared statements will not save you in any way, because the table name is considered part of the query, not part of the data.

Unless you have a very compelling reason for such an unusual design, I would strongly recommend changing to have a single table of chat_logs, with a column of chat_room_id which references a chat_rooms table. The chat_rooms table can then contain the name, which can contain any characters the user wants, along with additional data about the room - creation date, description, extra features, etc. This approach requires exactly 2 tables, however many chat rooms are created.

If you really think you need the separate table for each chat room, because you're trying to do some clever partitioning / sharding, I would still recommend having a chat_rooms table, and then you can simply name the tables after the chat_room_id, e.g. chat_logs_1, chat_logs_2, etc. This approach requires exactly one more table than your current approach, i.e. num_tables = num_chat_rooms + 1.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
0

CHARACTER SET utf8mb4 is needed end-to-end for MySQL in order to store Emoji and some Chinese characters.

In this you will find more on "best practice" and debugging tips when you fail to follow the best practice. It's not just the column charset, it is also the client's charset.

Do not use any encode/decode routines; it only makes the mess worse.

It is best to put the actual characters in MySQL tables, not Unicode strings like U+1F914 or \u1F914, etc.

is 4 bytes of hex F09FA494 when encoded in UTF-8 (aka MySQL's utf8mb4).

And, I agree with IMSoP; don't dynamically create table.

SQL Injection should be countered with mysqli_real_escape_string (or equivalent, depending on the API), not urlencode or base64.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Your comment has little to do with what I'm asking! And it would be improper to use **any** `mysql_*` calls, as they are deprecated. The respective MySQLi or PDO driver with prepared statements is an obviously better option, but again this does not affect the question that I asked. – Sharam Patel Oct 19 '17 at 00:40
  • OK, you _must_ do two things -- not have a table per chat room (since table names are limited to utf8, not utf8mb4), _and_ use utf8mb4 throughout. (And I fixed the name of the function, which I had sloppily copied from the link.) – Rick James Oct 19 '17 at 00:45
  • I still fail to see what value your comment brings to this question. I never mentioned that I had a problem with the collation or storing information in columns and I specifically mentioned that table names / identifiers don't support supplementary characters (which is exactly the difference between `utf8` and `utf8mb4`). You also fail to explain _why_ I need **not** have a table per chat room. In fact for the original implementation I was thinking of sharding the database my own way, as guessed by @IMSoP. – Sharam Patel Oct 19 '17 at 03:02
  • A common question in this forum is about having one table per X. The answer is a resounding NO. It is better to have a single table with a distinguishing column. The reasons tend to center around clumsiness. – Rick James Oct 19 '17 at 03:08
  • As for sharding... Yes, a table per chat room would make it simpler to migrate a room to a different server. But the big problem in migrating a room is trying to minimize the downtime. – Rick James Oct 19 '17 at 03:10
  • Emoji in a table name is a show stopper. So convert to something else. Then you find a hacker who adds some punctuation to a table name that leads to a novel form of SQL injection. So you prevent that. Etc. Better to simply have the chat room as a column. Base64 comes close, but one or two punctuation marks will need replacing. Names would be limited to 48(?) ascii letters or 12(?) Emoji (or something in between. Explain that to the user. Etc. – Rick James Oct 19 '17 at 03:15
  • 10K MySQL tables turns into 20K files in a directory, which somewhat chokes the OS. – Rick James Oct 19 '17 at 03:16
  • You _did_ mention "question marks" coming from the chat room names -- this is a common UTF-8 issue with MySQL. – Rick James Oct 19 '17 at 03:19
  • You're assuming a lot and that's dangerous. I still fail to see the value of your comments concerning **this particular question**. – Sharam Patel Oct 19 '17 at 17:50