1

I have a script in PHP that stores values to a MySQL database from a web store. The store allows customers to leave a message which can create havoc when they use emojis. To prevent these characters from breaking my script I've used FILTER_SANITIZE_STRING and FILTER_FLAG_STRIP_HIGH on all my strings prior to sending them all to MySQL.

This works well except for when I display it again in a Java program I've written I'll have things like "I'm" instead of "I'm".

Is there a way to have Java find and convert the ASCII values back into characters?

My current plan of attack is to have a function that takes each relevant string column, examines each word looking for &#, finds the position of the simi-colon after the &#, replaces that value with the corresponding ASCII character, and returns the new string.

It's doable, but I'm hoping there is an existing means to do this without re-inventing the wheel.

Edit: Thank you to @rzwitserloot for pointing me in the right direction, for anyone who sees this and does not read my comment in his answer, I ended up using JSoup. Here is a snippet of the final code section related to this on the Java side for anyone else working through this:

// Connect method opens a connection to the MySQL server 
connect();
// Query the MySQL server 
resultSet = statement.executeQuery("select * from order_tracking order by DateOrdered");

// If there is any result, iterate through them until the end is reached. 
while (resultSet.next()) { 
  // Add each returned row into the list to send to the table
  Jsoup.parse(resultSet.getString(2)).text()
.
.
.
}

The .text() at the end of the Jsoup.parse(String) gets rid of the html formatting (i.e. <Head><Body> etc) that Jsoup automatically throws in and returns only the text portion with the &#38; (or whatever ascii value it might be) properly formatted.

Thanks!

Abishur
  • 189
  • 3
  • 13
  • see this [answer](https://stackoverflow.com/a/23392684/2310289) it looks like encoding is optional so maybe don't do it? – Scary Wombat Jan 21 '19 at 02:58
  • What is your DB's character set? If you're using `utf8mb4` you should be able to store the emojis as well. If that is the correct character set maybe you need to use parameterized queries – user3783243 Jan 21 '19 at 03:02
  • @user3783243 my database's character set is utf8_general_ci. I am using currently using parameterized queries. – Abishur Jan 21 '19 at 03:09
  • Are you seeing `"I'm"` or `"I'm"`? That is, is there a semicolon in there you left out in your question, or is that missing? –  Jan 21 '19 at 03:12
  • @ScaryWombat, when I don't filter out the emoji's the insert command fails. I'll check out the link you provided – Abishur Jan 21 '19 at 03:13
  • @duskwuff I am seeing "I'm;" I will edit the post to correct this omission. – Abishur Jan 21 '19 at 03:14
  • If you have `utf8mb4` available you should use that. It will solve your issue. – user3783243 Jan 21 '19 at 03:20

2 Answers2

1

The best solution is to just fix the initial bit: Of course databases (and mysql in general) can store emojis, but mysql is weird. utf8 isn't utf8, it's misnamed. The real utf8 in mysql is called utf8mb4. Use that encoding and you can store smiley's just fine.

If that option somehow doesn't work for you, your strings are HTML encoded. The solution is to HTML-decode them. Java doesn't ship with one out of the box, you need a dependency. There's this, for example: http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html#unescapeHtml4(java.lang.String)

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • Unfortunately, the code on the PHP side is so extensive that making use of utf8mb4 and getting rid of the `FILTER_SANITIZE_STRING` would be a massive undertaking. So even though I think changing the database is the most correct action, I'll have to go the HTML-decode route. I've resolved the problem by using JSoup (https://jsoup.org/) on one line on the Java side. Thanks! – Abishur Jan 21 '19 at 06:05
0

You have HTML-escaped entities in your database. This isn't ideal, but it's easy to reverse. Pass the string to html_entity_decode() to reverse this process.

  • I'll remember the html_entity_decode() if I have to reverse it in PHP at some point down the road. For this question, however, I need to decode it in Java. I ended up using Jsoup as it let me solve the issue with a single line of code in my Java program. Thanks! – Abishur Jan 21 '19 at 06:07