2

I've got a .txt File, which contains millions of postalcodes. There are all well-formatted (readable). My Goal is, to import the postalcodes into a MySql-Database.
I need to convert the postalcodes from the .txt File into a .sql File.

I wrote an Java Application to Convert the important parts into an SQL-File.

EDIT: The .txt File contains many information. That's why I want to read the File (In Java), filter the File content and finally create a .sql File. Many thanks to any Solution written in "Python, SQL, C++, BrainF***" or other languages, but I would appreciate a Solution for my specific Java Code. The .txt File uses UTF-8 encoding.

After convert into ".sql", the content has several issues. For Example the Cyrillic Character "Я" is not "known" in the SQL-File. I assume, that the encoding is wrong.
It would be nice, if someone could help me to get a clue how to resolve this.

TXT-File: TXT-File

SQL-File: SQL-File

try (BufferedReader br = new BufferedReader(
            new FileReader(GeoData.class.getResource(sourceFilenameInput.getText().trim()).getFile().trim()))) {
        for (String line; (line = br.readLine()) != null;) {
            GeoData geoData = new GeoData();
            geoData.addOrt(getPlaceFromFile(line));
        }

getPlaceFromFile Method:

private String getPlaceFromFile(String line) {
    String[] placeHolder = line.split("\t");
    if (placeHolder .length > 2) {
        for (int i = 0; i < placeHolder .length - 2; i++) {
            if (!placeHolder [i + 2].trim().isEmpty() && placeHolder [i + 2].trim().length() > 3) {
                return filterPlace(placeHolder [i + 2].trim(), "'", "\\", "^", ";", "*", "|");
            }
        }
    }
    return "EMPTY";
}

FilterPlace Method:

private String filterPlace(String place, String... filter) {
    String newPlace = place;
    for (String string : filter) {
        if (newPlace .trim().contains(string))
            newPlace = newPlace .trim().replace(string, " ");
    }
    return newPlace;
}

What have I tried so far?
I replaced the FileReader with an InputStreamReader and used the FileInputStream with the Charset UTF-8. After converting in UTF-8 the SQL-File looked like this:
UTF8 Converted SQL-FILE

Michael
  • 59
  • 1
  • 8
  • Can you get the hex of a line in the .txt file that has `Я`? – Rick James Jan 14 '18 at 18:10
  • `D18F я D0AF Я` -- You are asking about uppercase YA, but showing lowercase YA. – Rick James Jan 14 '18 at 18:16
  • Hex of the complete String from line 6: c390e2809dc391e282acc391efbfbdc390c2bdc390c2bac390c2bec390c2b2c390c2b5c391e280a0202f2044726a616e6b6f766563 – Michael Jan 14 '18 at 18:19
  • The Я was just an example... there are more Cyrillic Characters, which are unknown (as described in my Question) – Michael Jan 14 '18 at 18:21
  • Doing away with the old utility classes FileReader and FileWriter is just fine. Just the small Cyrillic ya seems to have been corrupted. Do you somewhere have a `new String(string.getBytes(str, ...), ...)` / something with 0x8F? – Joop Eggen Jan 15 '18 at 13:02
  • Nope, not using any "new String" and / or "str.getBytes(..., ..)". What you see is what I use (Code). – Michael Jan 15 '18 at 19:50
  • Still not resolved... still no clue. For all new answers: Please read the Question and what Code I'm using. I'm **not** using any "hidden"-Code neither I'm "hiding" any Code. This simple Lines are all I've got. – Michael Jan 16 '18 at 13:38

2 Answers2

1

I've found an answer.

I just edited the encoding Format from eclipse to solve the Problem.

Workspace > Preferences > General > click on Workspace and edit the Text file encoding from Cp1252 to UTF-8.

Thats all and it worked.

Skandix
  • 1,916
  • 6
  • 27
  • 36
Michael
  • 59
  • 1
  • 8
0

If the .txt columns are split on \t, do not convert before loading. Instead simply use the LOAD DATA INFILE ... SQL statement, something like

 LOAD DATA LOCAL INFILE 'the_file.txt'
     INTO tablename
     CHARACTER SET utf8mb4
     COLUMNS TERMINATED BY "\t"
     LINES TERMINATED BY "\n"
     (col1, zip, name);

I don't know how to run that from Java. But none of the code you presented should be needed.

Your comment has some garbage in it --

E2808B     8203=x200B  [​]   BN  ZERO WIDTH SPACE
E2808C     8204=x200C  [‌]   BN  ZERO WIDTH NON-JOINER

(I spotted it by double clicking on the hex to copy it. But it stopped in the middle!)

Д is hex D094, but if you double encode it, you get hex C390 E2809D. See that link for likely causes of such.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Sorry, but I dont understand your Answer... I don't know what you exactly want from me. – Michael Jan 14 '18 at 18:23
  • Please show me, where I encode in my whole Code. I can't find it. – Michael Jan 14 '18 at 18:27
  • Toss your conversion code and use `LOAD DATA` instead. If you still get `C3...` instead of `D0...`, then read about "double encoding". You have not set "utf8" in all the places needed. – Rick James Jan 14 '18 at 18:27
  • (Don't worry about the E2...; fix the rest.) – Rick James Jan 14 '18 at 18:28
  • You're being cryptic now, Rick. Where should I add "LOAD DATA"?? I really don't get it. It would be nice, if you could be more specific. For Example: "Hey, second Line of Code, replace XX with YY and you are good to go ;)" – Michael Jan 14 '18 at 18:33
  • I spelled out `LOAD DATA`, but without the specifics for your table. And gave you a reference to the manual page. – Rick James Jan 14 '18 at 19:57
  • I asked about a Solution for my **Java** Application, not for an alternative SQL Solution. – Michael Jan 15 '18 at 11:10