0

I'm using jsoup to get all text from websites.

Document doc = Jsoup.connect("URL").get();
String allText  doc.text().toLowerCase();

Then I'm using Hibernate to persist the object that holds all text to a MySQL DB:

...
@Column(name="all_text")
@Lob
private String allText = null;
...

Everything is good so far. Only that sometimes I get a MySQL error when I try to save the object with allText:

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A s...' for column 'all_text' at row 1

Already looked this up and it's an encoding error. Probably have some special characters on their websites. I found a way to fix this by changing the encoding in the DB.

But my actual question is: what's the best way to filter and remove the special characters from the allText string and not persist them at all?

EDIT: To clarify, by special characters I mean Emoticons and all that stuff. Definitely anything that doesn't fit into UTF-8 encoding. I'm not concerned about ~ ^ etc...

Thanks in advance!

co ting
  • 125
  • 1
  • 8
  • Possible duplicate of [How to remove special characters from a string?](https://stackoverflow.com/questions/7552253/how-to-remove-special-characters-from-a-string) – John Humphreys Jan 17 '18 at 21:13
  • Sorry for close vote; there are a lot of versions of this question on SO already though, and the first asker is supposed to get the points/credit :). Doing this in Java is always a little frustrating. There is a good answer for this here already though: https://stackoverflow.com/a/7552284/857994. Most answers I've seen use regex but it always depends on your use case's definition of "special characters". – John Humphreys Jan 17 '18 at 21:14
  • Just edited. Yes I saw the other solutions but none really address my problem. Maybe I used the wrong expression to say "special characters" I'm talking about all emoticons and whatnot that people could use. Not sure if there's a regex that can cover all of them? – co ting Jan 17 '18 at 21:25

1 Answers1

0

Just use regex:

allText.replaceAll("\\p{C}", "");

Don't forget to import java.util.regexPattern

dvijok
  • 89
  • 8