I'm getting a SQL error during a large data conversion for a few entries caused by trying to insert invalid characters into a column with collation utf8_general_ci. An example of an offending string is
I'm already using a regular expression similar to that described in the answer on Remove ✅, 🔥, ✈ , ♛ and other such emojis/images/signs from Java strings however, the above string makes it through that regex filter, because the characters are letters, and hence p{L}
is true for them.
For proof:
String walgreens = "\uD835\uDC16\uD835\uDC1A\uD835\uDC25\uD835\uDC20\uD835\uDC2B\uD835\uDC1E\uD835\uDC1E\uD835\uDC27\uD835\uDC2C";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\p{Sc}\\p{Punct}\\p{ASCII}\\™\\®\\©\\s]";
System.out.println(walgreens.replaceAll(characterFilter, ""));
and similarly, just to illustrate the exact filter used in the above SO answer:
String walgreens = "\uD835\uDC16\uD835\uDC1A\uD835\uDC25\uD835\uDC20\uD835\uDC2B\uD835\uDC1E\uD835\uDC1E\uD835\uDC27\uD835\uDC2C";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
System.out.println(walgreens.replaceAll(characterFilter, ""));
Both print to the console.
I need a way to programmatically remove these characters from Strings, preferably via regex for consistency, so I can properly make updates to my MySql DB. Updating the column collation is not an option due to the size of the tables.
Here's a link to a unicode page for the 'W' in the offending string, if helpful
I'm currently on MySql version 8.0.23 and Java 8, using EclipseLink (JPA) as my ORM.
Heres the error
javax.persistence.PersistenceException: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.0.v20130507-3faac2b): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: Incorrect string value: '\xF0\x9D\x99\x88\xF0\x9D...' for column