4

I'm working on processing Tweets from Twitter and storing them in a database (MySQL).

I have my process running perfectly but sometimes I get an error like this one:

2012-08-31 08:11:23,303 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - SQL Error: 1366, SQLState: HY000
2012-08-31 08:11:23,304 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - Incorrect string value: '\xF0\x9F\x98\x9D #...' for column 'twe_text' at row 1

When looking for the problematic tweet in my logs I find the following one:

 2012-08-31 08:11:22,971 INFO com.myapp.TweetLoaderJob  - Text for tweet 241175722096480256: RT @totallytoyosi_: My go
odies, my goodies, not your goodies  <U+1F61D> #m&ms #sweeties #goodies #food  @ The Ritzy Cinema Café, Brixton htt ...

And, finally, looking what the hell is , I discovered that it is an emoticon that Twitter sends as-is

I have debugged, looking only for this specific tweet and my eclipse seems to not recognize this encoding character. So the question is, how can I handle this exception? I looked for configuring my MySQL database, but I cannot change the encoding (it's a requirement), so my option is to avoid managing this kind of tweets or supress this complicated character.

But how to do it, if Java does not recognize it?

Alex_ES
  • 205
  • 2
  • 15

1 Answers1

1

You could filter your strings and remove the undesired part (with a simple regexp like <U+[^>]+>) before storing them in your database.

m4573r
  • 992
  • 7
  • 17
  • I have just tried it, but it doesn't work... I guess that is the enconding but not the text received. – Alex_ES Sep 03 '12 at 06:47
  • 1
    I just found this: `s = s.replaceAll("[^\\x00-\\x7f]", "");`. Would that work for you? – m4573r Sep 03 '12 at 08:00
  • 1
    At the end, I solved the problem with a dirty workaround (but it solves the problem in the same way, except that an question mark is set instead of that emoticon): `new String(status.getText().getBytes("ISO-8859-1"));` – Alex_ES Sep 04 '12 at 09:45
  • I have tested your solution @m4573r. It works fine, but I loose characters like _£_. Where did you find that regex? Maybe I could accurate it a bit more... – Alex_ES Sep 04 '12 at 11:38
  • From [a similar question](http://stackoverflow.com/questions/5008422/how-to-remove-high-ascii-characters-from-string-like-in-java) on SO. I guess it's definitely possible to modify the range of characters you want to filter out. – m4573r Sep 04 '12 at 11:46
  • 3
    Solved the problem (yesterday) with this regex: `[^\\x00-\\x7f-\\x80-\\xad]`. Thanks! – Alex_ES Sep 05 '12 at 10:31