0

I'm getting a SQL error during a large data conversion for a few entries caused by trying to insert invalid characters into a column with collation utf8_general_ci. An example of an offending string is


I'm already using a regular expression similar to that described in the answer on Remove ✅, 🔥, ✈ , ♛ and other such emojis/images/signs from Java strings however, the above string makes it through that regex filter, because the characters are letters, and hence p{L} is true for them.

For proof:

String walgreens = "\uD835\uDC16\uD835\uDC1A\uD835\uDC25\uD835\uDC20\uD835\uDC2B\uD835\uDC1E\uD835\uDC1E\uD835\uDC27\uD835\uDC2C";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\p{Sc}\\p{Punct}\\p{ASCII}\\™\\®\\©\\s]";
System.out.println(walgreens.replaceAll(characterFilter, ""));

and similarly, just to illustrate the exact filter used in the above SO answer:

String walgreens = "\uD835\uDC16\uD835\uDC1A\uD835\uDC25\uD835\uDC20\uD835\uDC2B\uD835\uDC1E\uD835\uDC1E\uD835\uDC27\uD835\uDC2C";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
System.out.println(walgreens.replaceAll(characterFilter, ""));

Both print to the console.

I need a way to programmatically remove these characters from Strings, preferably via regex for consistency, so I can properly make updates to my MySql DB. Updating the column collation is not an option due to the size of the tables.

Here's a link to a unicode page for the 'W' in the offending string, if helpful

I'm currently on MySql version 8.0.23 and Java 8, using EclipseLink (JPA) as my ORM.

Heres the error

javax.persistence.PersistenceException: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.0.v20130507-3faac2b): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: Incorrect string value: '\xF0\x9D\x99\x88\xF0\x9D...' for column
J. LaF
  • 127
  • 9
  • I've also tried once to remove 4-byte sequences for a MySQL system but tbh since regex classes are not byte-based it gets somewhat cumbersome fast. If you have the option I'd suggest switching to collation utf8mb4_general_ci to simply avoid the issue – Roland Kreuzer Aug 31 '22 at 18:03
  • From the original question: Updating the column collation is not an option due to the size of the tables. Last I checked, thats a linear time operation which would require far too much down time – J. LaF Aug 31 '22 at 18:13
  • Are you trying to deal with bad characters already INSERTed into a MySQL table? Or do you have gibberish in Java? Or is the Java OK, but it gets mangled upon inserting into MySQL?... – Rick James Sep 01 '22 at 05:21
  • I did deduce that utf8mb4 is needed; utf8mb3 will not suffice. – Rick James Sep 01 '22 at 15:28
  • I'm moving data into a MySql table and want to filter out unsupported characters. The collation on the column in question is utf8_general_ci, so I need to remove emojis, other language characters, and . The first two are easy with regex; however, I need help with the last – J. LaF Sep 01 '22 at 16:57

0 Answers0