2

I am using Java 1.4 as my client requirement as well as lucene-core-2.9.2.jar and lucene-demos-2.9.2.jar. I am using Ant to build. It works fine for all directory except Unicode and scandic char.

When I try to listing using listFiles(), it lists all but unicoded data shows as block. When it wants to read the list using isDirectory(), it can not define those folder name for indexing which are other languages(containing unicode or scandic char).

How can i solve this problem for using unicoded data and scandic char?

If I use Java 6 or 7,It works well.So as per client need(Java 1.4), please don't tell me to use java 5,6 or 7. Give other valuable answers. As your best understanding, I added my code below

public void addIntoIndex(File dir, IndexWriter indexWriter) {       
try {
    System.out.println("Now in addIntoIndex");
    File[] htmls = dir.listFiles();

    /** "Release_Notes" folder will be excluded for indexing */
    if(dir.getName().equals("Release_Notes") && this.searchOption.equals("systemHelp")) {
        System.out.println("'Release_Notes' folder will be excluded for indexing.");
        return;
    }

    for(int i = 0; i < htmls.length; i++){
        String htmlPath = htmls[i].getAbsolutePath();   

        if(htmls[i].isDirectory()) {
            addIntoIndex(new File(htmls[i].getAbsolutePath()), indexWriter);
        }

        if(htmlPath.endsWith(".html") || htmlPath.endsWith(".htm")){
            addDocument(htmlPath, indexWriter);
        }
    }

} catch (Exception e) {
    e.printStackTrace();
}
}
Hamid Shatu
  • 9,664
  • 4
  • 30
  • 41
SkyWalker
  • 28,384
  • 14
  • 74
  • 132
  • Can you check you are at least using the latest version of 1.4 e.g. 1.4.2 update 30. I would also try Java 5.0 or 6 or 7 to see if it fixes the problem, because it might not i.e. this will tell you if it is a bug which was fixed later or perhaps something else which is worng. – Peter Lawrey Dec 17 '13 at 09:54
  • What systems does this affect? Linux/Windows/Both? – txtechhelp Dec 17 '13 at 10:29
  • I am using windows 7. – SkyWalker Dec 17 '13 at 10:32

2 Answers2

1

At last my problem is solved. Actually I am indexing all my html files which are as

<html>
<head>..</head>
<body>...</body>
</html>

in this format.

After adding the following 2 lines in head section, this problem solved in my java 1.4.02 version.

<meta http-equiv=Content-Type content="text/html; charset=utf-8">
<meta http-equiv="content-script-type" content="text/javascript; charset=UTF-8"/>

Special thanks to my project manager and Peter Lawrey and txtechhelp

Community
  • 1
  • 1
SkyWalker
  • 28,384
  • 14
  • 74
  • 132
0

Try this link that has some relevent answers for you: https://forums.oracle.com/thread/1288135

You can try here as well for some other possibilities: Setting java locale settings

basically it sounds like you just need to ensure the right locale is configured to get the correct Unicode strings.

Community
  • 1
  • 1
txtechhelp
  • 6,625
  • 1
  • 30
  • 39
  • Actually I am searching data after indexing.If any folder name as tt.htm and it contains unicode, it can search all unicode from tt.htm. If I have any folder name or file name with unicode, it cannot index this thats why it can not search the unicoded data. – SkyWalker Dec 17 '13 at 10:26
  • How can I index the unicoded folder name or unicoded file name? – SkyWalker Dec 17 '13 at 10:28