My project is designed to take a URL, get the create date of said URL, and extract specific information from the URL. All of these parameters get passed successfully to mySQL if and only if they are in English and Spanish; however, whenever I encounter a foreign excerpt, such as:
بسم الله الرحمن الرحيم نسألكم الدعاء
mysql translates it to:
??? ???? ?????? ?????? ?????? ??????
I understand that this is a UTF-8 issue. On intellij, I could see foreign characters just fine when I print the line, so I am assuming that whatever JSoup retrieved is fine.
Below is the Java code. In case it is important, I am connecting to the database with c3p0. I am confident that establishing a connection to the database is not the problem, but for the sake of it being needed, I can provide it.
import org.jsoup.Jsoup;
import java.io.IOException;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.sql.PreparedStatement;
import org.jsoup.nodes.Document;
import java.beans.PropertyVetoException;
import com.mchange.v2.c3p0.*;
public class Connect {
private static final String URL = "jdbc:mysql://localhost:3306/testdb?allowMultiQueries=true";
private static final String USER = "root";
private static final String PASSWORD = "1234";
//Connection information here
public static void addlink(String url, String body, String createDate, String retrieveDate) { // adds html information to the database
Connection connection = null;
PreparedStatement statement = null;
try {
connection = cpds.getConnection();
statement = connection.prepareStatement("INSERT IGNORE INTO testtable(URL, Creation_Date, Retrieval_Date, Body) VALUES(?, ?, ?, ?);");
statement.setString(1, url);
statement.setString(2, createDate);
statement.setString(3, retrieveDate);
statement.setString(4, body);
statement.executeUpdate();
} catch // error handling
}
public void getPageData(String url, String retrieveDate) throws IOException { // gets the html information
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
String str = doc.body().text();
int endOfBody = str.length(); //for cutting out needless info in html text
StringBuilder body = new StringBuilder(str);
body.replace(0, 25, ""); // cut out unnecessary header info
body.replace(endOfBody - 128, endOfBody, ""); // cut out unnecessary trailer info
String finalBody = body.toString();
String createDate = finalBody.substring(finalBody.length()-10, finalBody.length());
addlink(url, finalBody, createDate, retrieveDate);
}
}
As far as changes I have made to the database, the body of the Url is passed as MEDIUMTEXT and I did:
mysql> ALTER TABLE testtable
-> DEFAULT CHARACTER SET utf8
-> collate utf8_general_ci
-> ;
Thanks in advance for any insight you all could share.
Edit: This has been marked as a duplicate, but the forum post in question is only one step to having mysql convert to unicode.