0

My project is designed to take a URL, get the create date of said URL, and extract specific information from the URL. All of these parameters get passed successfully to mySQL if and only if they are in English and Spanish; however, whenever I encounter a foreign excerpt, such as:

بسم الله الرحمن الرحيم نسألكم الدعاء

mysql translates it to:

??? ???? ?????? ?????? ?????? ??????

I understand that this is a UTF-8 issue. On intellij, I could see foreign characters just fine when I print the line, so I am assuming that whatever JSoup retrieved is fine.

Below is the Java code. In case it is important, I am connecting to the database with c3p0. I am confident that establishing a connection to the database is not the problem, but for the sake of it being needed, I can provide it.

import org.jsoup.Jsoup;
import java.io.IOException;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.sql.PreparedStatement;
import org.jsoup.nodes.Document;
import java.beans.PropertyVetoException;
import com.mchange.v2.c3p0.*;


public class Connect {

private static final String URL = "jdbc:mysql://localhost:3306/testdb?allowMultiQueries=true";
private static final String USER = "root";
private static final String PASSWORD = "1234";

//Connection information here

public static void addlink(String url, String body, String createDate, String retrieveDate) { // adds html information to the database
    Connection connection = null;
    PreparedStatement statement = null;
    try {
        connection = cpds.getConnection();
        statement = connection.prepareStatement("INSERT IGNORE INTO testtable(URL, Creation_Date, Retrieval_Date, Body) VALUES(?, ?, ?, ?);");
        statement.setString(1, url);
        statement.setString(2, createDate);
        statement.setString(3, retrieveDate);
        statement.setString(4, body);
        statement.executeUpdate();
    } catch // error handling
}


public void getPageData(String url, String retrieveDate) throws IOException { // gets the html information
    Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
    String str = doc.body().text();
    int endOfBody = str.length(); //for cutting out needless info in html text
    StringBuilder body = new StringBuilder(str);
    body.replace(0, 25, ""); // cut out unnecessary header info
    body.replace(endOfBody - 128, endOfBody, ""); // cut out unnecessary trailer info
    String finalBody = body.toString();
    String createDate = finalBody.substring(finalBody.length()-10, finalBody.length());
    addlink(url, finalBody, createDate, retrieveDate);
    }
}

As far as changes I have made to the database, the body of the Url is passed as MEDIUMTEXT and I did:

mysql> ALTER TABLE testtable
-> DEFAULT CHARACTER SET utf8
-> collate utf8_general_ci
-> ;

Thanks in advance for any insight you all could share.

Edit: This has been marked as a duplicate, but the forum post in question is only one step to having mysql convert to unicode.

  • try editing your connection string to something like this `jdbc:mysql://server/database?characterEncoding=UTF-8` – Enwired Nov 03 '16 at 23:51
  • I doubt this is an issue with intellij; you might want to consider removing the tag. – ChiefTwoPencils Nov 03 '16 at 23:56
  • 1
    Possible duplicate of [Java PreparedStatement UTF-8 character problem](http://stackoverflow.com/questions/3828818/java-preparedstatement-utf-8-character-problem) – Enwired Nov 03 '16 at 23:56
  • @Enwired I added it in, got the same result, a line of ?s. Could it be Jsoup? – adespotakis Nov 04 '16 at 02:06

1 Answers1

1

turns out that UTF-8 needs to be specified a lot in the Java code in order for it to work. Here is the outline:

1) Append the following to the URL you use to connect to mysql (credit goes to @Enwired):

useUnicode=yes&characterEncoding=UTF-8"

So you get:

URL = "jdbc:mysql://localhost:3306/testdb?useUnicode=yes&characterEncoding=UTF-8";

2) When you are adding the entry, add the following to the code:

java.sql.Statement unicode = null;
try {
        // note, how you connect does not matter
        connection = cpds.getConnection(); 
        unicode = connection.createStatement();
        unicode.executeQuery("SET NAMES 'UTF8';");
        unicode.executeQuery("SET CHARACTER SET 'UTF8';");
        // Other prepared statements. 
    } catch (SQLException e) {
       // ...

3) Go into mysql and change the collation of database, table, and column that will receive utf8 characters. How to change the default collation of a database?

Your mysql server should now accept unicode.

Community
  • 1
  • 1