1

I'm using Workbench and MySQL to load an XML file into a database. The XML file (which I didn't create) has tags inside of tags, some of which are in Japanese. I'm not sure if I'm dealing with UTF-8 correctly.

I've tried setting the characters to utf8 and utf8mb4, and collating to utf8mb4_unicode_520_ci, utf8_general_ci, utf8mb4_unicode_ci, and utf8mb4_unicode_520_ci. I've also read through the section in the MySQL reference manual about Chinese, Korean, and Japanese (CKJ) characters. It seems like people usually get "??" with CKJ errors that involve UTF8, but we're getting NULLs. I've tried switching between CHARs, VARCHARs, NVARCHARs, and NCHARs too...

Our XML has entries like:

<entry>
  <ent_seq>1000220</ent_seq>
  <k_ele>
    <keb>明白</keb>
    <ke_pri>ichi1</ke_pri>
    <ke_pri>news1</ke_pri>
    <ke_pri>nf10</ke_pri>
  </k_ele>
  <r_ele>
    <reb>めいはく</reb>
    <re_pri>ichi1</re_pri>
    <re_pri>news1</re_pri>
    <re_pri>nf10</re_pri>
  </r_ele>
  <sense>
    <pos>&adj-na;</pos>
    <gloss>obvious</gloss>
  </sense>
</entry>

And the MySQL is:

CREATE DATABASE main_db DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

USE main_db;

CREATE TABLE IF NOT EXISTS jmdict(
  ent_seq VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL PRIMARY KEY,
  keb VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  reb VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  pos VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  lsource VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
  gloss VARCHAR(2000)  CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci # english
);

LOAD XML INFILE "[file location]"
INTO TABLE jmdict
CHARACTER SET utf8mb4
ROWS IDENTIFIED BY '<entry>';

I expected to have "明白" in the keb field and "めいはく" in the reb field, but instead I'm getting NULL in both of those fields (and I correctly get "1000220" in ent_seq, "&adj-na;" for pos, and "overt" for gloss).

Johnson789
  • 11
  • 3

0 Answers0