0

I have Gujarati Bible and trying to insert each verse in MySQL database using parser written in Java. When I assign Gujarati text to Java String variable it shows junks in debug.

E.g. This is my Gujarati text

હે યહોવા તું મારો દેવ છે;

I assign it to Java String variable as shown below

verse._verseText = "હે યહોવા તું મારો દેવ છે;";

What i see in debug window is all junk characters. Any help is appreciated. If need more information let me know and I will provide as and when asked.

UPDATE Pasting my parser code here

private Boolean Insert(String _text)
{
    BibleVerse verse = new BibleVerse();
    String[] data = _text.split("\\|");
    try
    {
        if (data[0].equals(bookName) || bookName.equals("All"))
        {
            verse._Version = "Gujarati";
            verse._book = data[0];
            verse._chapter = Integer.parseInt(data[1]); 
            verse._verse = Integer.parseInt(data[2]);
            verse._verseText = new String(data[3].getBytes(), "UTF-8");
            _bibleDatabase.Insert(verse);
            pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - INSERTED.");
        }
        else
        {
            pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - SKIPPED.");
        }
        return true;
    }
    catch(Exception e)
    {
        pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
        return false;
    }       
}

Here is the sample line from the text file

Isaiah|25|1|હે યહોવા તું મારો દેવ છે; હું તને મોટો માનીશ, હું તારા નામની સ્તુતિ કરીશ; કેમકે તેં અદભુત કાર્યો કર્યાં છે, તેં વિશ્વાસુપણે તથા સત્યતાથી પુરાતન સંકલ્પો પાર પાડ્યા છે.

UPDATE Here is the code where I open & read file.

try 
    {
        FileReader _file = new FileReader(this._filename);  
        _bufferedReader = new BufferedReader(_file);

        SwingWorker parseWorker = new SwingWorker()
        {
            @Override
            protected Object doInBackground() throws Exception 
            {
                String line;
                String[] data;
                int lineno=0;
                BibleVerse verse = new BibleVerse();

                while ((line = _bufferedReader.readLine()) != null) 
                {
                    ++lineno;
                    pcs.firePropertyChange("pgbupdate", null, lineno);
                    Insert(line);
                }
                _bufferedReader.close();
                return null;
            }

            @Override
            protected void done()
            {
                pcs.firePropertyChange("logupdate", null, "Parsing complete.");
            }
        };
        parseWorker.execute();
    } 
    catch (Exception e) 
    {
        pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
    }
Sherebyah Tishbi
  • 123
  • 1
  • 10
  • How are you compiling? Are you specifying a file encoding when you compile? What operating system are you using? (The OS determines the default charset.) What is “debug window”? Is it a terminal, a Windows command window, an IDE output pane? Are you trying to display the hard-coded string shown in your question, or a string you retrieved from the database? – VGR Dec 10 '15 at 15:33
  • you need to use UTF-8 encoding. See my answer for a similar question for Arabic [Answer on Arabic Text](http://stackoverflow.com/questions/34113606/how-can-i-show-arabic-query-search-from-mysql-by-javafx/34131549#34131549) – Sabir Khan Dec 10 '15 at 15:36
  • @VGR I am trying to indert to MySQL from a java parser. Parser takes text file as an input and parses whole file line by line extracting the text and then inserting that into MySQL. Debug window is Eclipse debug perspective. – Sherebyah Tishbi Dec 10 '15 at 15:55
  • @Sabir_Khan I tried your solution but it seems it is working partially. This is what I see now as an output of UTF-8 encoded string which still has some junk characters `હે યહોવા ત�?ં મારો દેવ છે; હ�?ં તને મોટો માનીશ, હ�?ં તારા નામની સ�?ત�?તિ કરીશ; કેમકે તેં અદભ�?ત કાર�?યો કર�?યાં છે, તેં વિશ�?વાસ�?પણે તથા સત�?યતાથી પ�?રાતન સંકલ�?પો પાર પાડ�?યા છે.` – Sherebyah Tishbi Dec 10 '15 at 16:00
  • which IDE you using? Do you see correct value in DB? ( if select is run directly on some GUI tool ) – Sabir Khan Dec 10 '15 at 16:04
  • When are you examining this text in Eclipse? After you read each line from the file? – VGR Dec 10 '15 at 16:08
  • I am using Eclipse. Yes, reading each line from a txt file. I examine that text in Eclipse debug perspective as my text is assigned to Java String variable. – Sherebyah Tishbi Dec 10 '15 at 16:52
  • So `data[3]` looks incorrect in the Eclipse debug window, even before you attempt to insert it? – VGR Dec 10 '15 at 17:12
  • @VGR - you got that right. data[3] looks incorrect even after UTF-8 encoding. I have pasted sample output in one my earlier comment. Here it is again to show how it looks `હે યહોવા ત�?ં મારો દેવ છે; હ�?ં તને મોટો માનીશ, હ�?ં તારા નામની સ�?ત�?તિ કરીશ; કેમકે તેં અદભ�?ત કાર�?યો કર�?યાં છે, તેં વિશ�?વાસ�?પણે તથા સત�?યતાથી પ�?રાતન સંકલ�?પો પાર પાડ�?યા છે.` – Sherebyah Tishbi Dec 10 '15 at 17:50
  • `new String(data[3].getBytes(), "UTF-8")` is incorrect (and may even be making things worse). It should be replaced with just `data[3]`. Most likely, you are not reading the file using the file's encoding (charset). Can you show the code that opens the file and reads the lines? – VGR Dec 10 '15 at 19:59
  • @VGR when I replace `data[3].getBytes()` with just `data[3]`, it complains that there is no constructor which takes 2 String as arguments(`The constructor String(String, String) is undefined`). It suggests to convert it to Byte array. I updated question with code where I open & read file. – Sherebyah Tishbi Dec 14 '15 at 13:05
  • I did not suggest replacing `data[3].getBytes()`. I suggested replacing the entire `new String` constructor, along with both of its arguments. – VGR Dec 14 '15 at 14:36

3 Answers3

1

how to inject chinese characters using javascript?

not quite the same problem, but I think the same solution may work in this case.

If the script is inline (in the HTML file), then it's using the encoding of the HTML file and you won't have an issue.

If the script is loaded from another file:

Your text editor must save the file in an appropriate encoding such as utf-8 (it's probably doing this already if you're able to save it, close it, and reopen it with the characters still displaying correctly) Your web server must serve the file with the right http header specifying that it's utf-8 (or whatever the enocding happens to be, as determined by your text editor settings). Here's an example for how to do this with php: Set http header to utf-8 php If you can't have your webserver do this, try to set the charset attribute on your script tag (e.g. > I tried to see what the spec said should happen in the case of mismatching charsets defined by the tag and the http headers, but couldn't find anything concrete, so just test and see if it helps. If that doesn't work, place your script inline

Community
  • 1
  • 1
GeorgeWL
  • 333
  • 2
  • 18
1

The problem is this:

FileReader _file = new FileReader(this._filename);

This reads the file using the platform's default charset. If your data file is not encoded in that charset, you will get incorrect characters.

On Windows, the default charset is almost always UTF-16LE. On most other systems, it's UTF-8.

The easiest solution is to find out the actual encoding of your data file, so you can specify it explicitly in the code. The encoding of a file can be determined with the file command on Unix and Linux systems. In Windows, you may need to examine it with a binary editor, or install something like Cygwin, which has a file command of its own.

Once you know what it is, you should pass it explicitly to the construction of your Reader:

// Replace "UTF-8" with the actual encoding of your data file (if it's not UTF-8).
Reader _file = new InputStreamReader(new FileInputStream(this._filename), "UTF-8");

Once you've done that, there is no reason for any other part of your code to concern itself with bytes. You should replace this:

verse._verseText = new String(data[3].getBytes(), "UTF-8");

with this:

verse._verseText = data[3];
VGR
  • 40,506
  • 4
  • 48
  • 63
  • Thanks a lot VGR. I will be doing this shortly and will update you soon. Once again thanks much for this detailed help. – Sherebyah Tishbi Dec 15 '15 at 13:39
  • @VGR-My Issue is resolved now completely. I found out the encoding using good old notepad, it is "UTF-8". Just used that in the code line you mentioned and it is showing me all text exactly as expected without any junk characters. Once again a huge thank you my friend. – Sherebyah Tishbi Dec 15 '15 at 14:17
0

It looks like if you want to store Gujarati text in Java string, you need to use unicode characters. See this: http://jrgraphix.net/r/Unicode/0A80-0AFF

So for example the first Gujarati character:

char example = '0A80';
String result = Character.toString((char)example);
OPK
  • 4,120
  • 6
  • 36
  • 66
  • @JasonS does that mean that I will have to convert each character this way? I have whole bible to insert into database so I am talking about almost close to 1 million words and hence almost 5 million characters – Sherebyah Tishbi Dec 10 '15 at 15:50
  • Your answer has nothing to do with reading a text file. Also, your first line of code won't compile; you probably meant `'\u0A80'`. Also, there is no need to involve the `char` type just to include a Unicode escape in a String; one can just write `"\u0A80"`. – VGR Dec 10 '15 at 20:06