I'm using Jsoup to scrape a webpage. It takes the text and enters it directly into the database.
The text on the target webpage looks perfectly fine, but after entering it into the database i get question marks replacing certain characters.
For example the single right quotation marks (U+2019) in the following sentence:
I can’t imagine uh, a domain of human endeavor that isn’t impacted by the imagination.
Will show up like this in the database and on the webpage i'm outputting it on:
I can?t imagine uh, a domain of human endeavor that isn?t impacted by the imagination.
Initially i thought this was just a problem with the charset/collation of the database but after trying out different types, the problem persists...
The sql database i'm currently working in is in utf-8:
mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+
And the meta is set:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
I've tried specifically setting it in java like so:
url = "jdbc:mysql://localhost:3306/somedb?useUnicode=true&characterEncoding=utf-8";
I've tried sql queries like:
SET NAMES 'utf8'
SET CHARACTER SET utf8
I've tried creating a new database and nothing seems to work..
Any ideas why this might be happening?