-1

I want to look for non utf8 characters in my MySQL database when the user inputs addesses there are usually other characters, especially when a user copies directly from PDF file to input box

I tried this but it gives me all the columns doesn't matter if they have non utf8 characters. Is there a SQL query that would do this and only target non UTF-8 characters?

SELECT * FROM MyTable WHERE LENGTH(MyColumn) = CHAR_LENGTH(MyColumn)

This is my database table

table name: employees
emp_num(int)
birth_date(date)
first_name (varchar(15))
last_name (varchar(20))
gender (ENUM('M','F'))
address (varchar(50))

So what I did was

SELECT * FROM employees WHERE LENGTH(address) = CHAR_LENGTH(address)

Don't know if this is correct

enter image description here

this image is from my database, see the weird Y that is what is coming out and other characters too.

learningbyexample
  • 1,527
  • 2
  • 21
  • 42
  • This is exactly what you're looking for: https://dba.stackexchange.com/questions/77101/how-to-find-non-utf8-data-in-mysql/77154 – Martin May 07 '17 at 17:06
  • @Martin That appears to be Exactly what the OP has done? – RiggsFolly May 07 '17 at 17:11
  • @RiggsFolly I can't quite see what the question is then? the encoding of a column's contents is whatever the defined column is (except if it's blob?), – Martin May 07 '17 at 17:31
  • this did not work for me `SELECT * FROM employees WHERE LENGTH(address) = CHAR_LENGTH(address)` – learningbyexample May 07 '17 at 17:33
  • I think there's a conceptual understanding issue here (mine, yours, ours): the SQL columns data contents will be of the character encoding the column is set as -- so if the column is `utf-8` then all the data saved to that column is forced to be `utf-8`, even if it's not the *correct* character, it will be *a* character (or several) so you immediately loose the ability to have multiple recognisable characters sets in a single character set column (unless it's a `blob`, which I guess it isn't). Or.... am I missing something fundamental? – Martin May 07 '17 at 17:37
  • 2
    What do you mean by "non utf8 characters"? UTF-8 is a way of representing Unicode characters as bytes, not a type of character. –  May 07 '17 at 18:17
  • @Martin I have added a image of my database of the characters that show up. Please have a look – learningbyexample May 07 '17 at 21:04
  • @duskwuff I think that's what I knew, but didn't know how to articulate it... – Martin May 07 '17 at 21:07
  • the point is, @learningbyexample , that if the column is set to a `utf8` character set, it will only contain utf-8 characters -- even if they're encoded badly because they're not *originally* utf-8 characters. I suspect you may have to find these characters and correct them on a case-by-case basis. I also suspect you will need to work on your data insert coding to properly qualify your valid characters to save – Martin May 07 '17 at 21:09

1 Answers1

1

Ÿ is a valid utf8 character (hex C5B8: 2 bytes, 1 character). And a valid latin1 character (hex 9f)

So, using utf8:

mysql> SELECT LENGTH('Ÿ'), CHAR_LENGTH('Ÿ');
+--------------+-------------------+
| LENGTH('Ÿ')  | CHAR_LENGTH('Ÿ')  |
+--------------+-------------------+
|            2 |                 1 |
+--------------+-------------------+

So, your test with length vs char_length tests something, but not for "non utf8" characters.

In fact, the only "non utf8" characters are Emoji and some Chinese characters that are in utf8mb4 but not in utf8.

But maybe that was not your intended question???

Since you have not provided (1) the charset of the columns, nor (2) the charset of the connection, nor (3) what the text should have said, there is a limit to what can be diagnosed.

What is the "input box"? Is it an HTML field? Does it have

<form accept-charset="UTF-8">

Use SELECT HEX(col) ... to show us what is currently in the table. The see "Test the data" in here for a preliminary analysis of what the character(s) is.

Other

For searching for non-alphanum:

WHERE col RLIKE '[^a-zA-Z0-9_ ]'

would include rows that have something other than letters, digits, underscore, and space.

WHERE HEX(col) RLIKE '^(..)*[89ABCDEF]'

would check for any byte with the 8th bit on. That is, not entirely 7-bit ascii.

So, either specify your problem better, or learn about REGEXP. I suspect "utf8" is not the term to chase. The above RLIKEs will catch things in latin1, too.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222