2

In my Java code I use a FileVisitor to traverse a filesystem and creating a structure of Paths, then later on this is converted to a json object for rendering in html.

Running on Windows it runs okay even against a linux filesystem, running on Linux against the same (now local) filesystem it fails to render special characters properly when call toString() on a path

i.e Windows debug output

CreateFolderTree:createJsonData:SEVERE: AddingNode(1):Duarte Lôbo- Requiem

and html displays ok as

Duarte Lôbo- Requiem

but linux debug output gives

CreateFolderTree:createJsonData:SEVERE: AddingNode(1):Duarte L??bo- Requiem

and html displays as two black diamond with question mark in them instead of the ô char

Why is this happening, the Paths are provided by the the FileVisitor class so must be getting constructed properly (i.e I am not hacking it myself) , and then i just call toString() on the path.

Is it a fonts problem, I have had some issues with fonts on the linux system but here I am just returning Strings to the html so cannot see a conection.

Probably an encoding issue, but I cant see a place where I am explicitly setting an encoding

Bulk of code below, debugging showing invalid output for linux is in the createJsonData() method

Edit:I have fixed the logging issue so that the output is written as UTF-8

  FileHandler fe = new FileHandler(logFileName, LOG_SIZE_IN_BYTES, 10, true);
  fe.setEncoding(StandardCharsets.UTF_8.name());

So we now see Windows is outputting correctly

CreateFolderTree:createJsonData:SEVERE: AddingNode(1):Duarte Lôbo- Requiem

but Linux is outputting

CreateFolderTree:createJsonData:SEVERE: AddingNode(1):Duarte L��bo- Requiem

and if I view this in HexEditor it gives this output for L��bo

4C EF BF BD EF BF BD 62 6F

Edit:Partial Solution

I came across What exactly is sun.jnu.encoding?

and found it was recommended to add this

 -Dsun.jnu.encoding=UTF-8

and it worked files were now displayed okay

Unfortunately if user then clicked on such a file and sent back to server I now get this error

java.lang.NullPointerException
    at java.base/sun.nio.fs.UnixPath.normalizeAndCheck(Unknown Source)
    at java.base/sun.nio.fs.UnixPath.<init>(Unknown Source)
    at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
    at java.base/java.nio.file.Paths.get(Unknown Source)
    at com.jthink.songkong.server.callback.ServerFixSongs.configureFileMapping(ServerFixSongs.java:59)
    at com.jthink.songkong.server.callback.ServerFixSongs.startTask(ServerFixSongs.java:88)
    at com.jthink.songkong.server.CmdRemote.lambda$null$36(CmdRemote.java:107) 

I tried adding -Dfile.encoding=UTF-8 both in addtion or instead of the jnu option and that didnt help , the jnu option was the one I needed.

I shoudn't have to add this undocumented sun-jnu-encoding option so it seems to be that the server is broken in some way ?

Code

   import com.google.common.base.Strings;
    import com.google.gson.Gson;
    import com.google.gson.GsonBuilder;
    import com.jthink.songkong.analyse.analyser.Counters;
    import com.jthink.songkong.analyse.general.Errors;
    import com.jthink.songkong.cmdline.SongKong;
    import com.jthink.songkong.fileloader.RecycleBinFolderNames;
    import com.jthink.songkong.server.fs.Data;
    import com.jthink.songkong.server.fs.PathWalker2;
    import com.jthink.songkong.server.fs.State;
    import com.jthink.songkong.ui.MainWindow;
    import com.jthink.songkong.ui.progressdialog.FixSongsCounters;
    import spark.Request;
    import spark.Response;

    import java.io.IOException;
    import java.net.UnknownHostException;
    import java.nio.file.*;
    import java.nio.file.attribute.BasicFileAttributes;
    import java.util.ArrayList;
    import java.util.HashSet;
    import java.util.Map;
    import java.util.Set;
    import java.util.logging.Level;


    /**
     * Count the number of files that can be loaded, for information purposes only
     */
    public class CreateFolderTree
    {
        private Path treeRoot;

        Set<Path> keys = new HashSet<Path>();


        public static class VisitFolder
                extends SimpleFileVisitor<Path>
        {

            private Set<Path> keys;
            private Integer maxDepth;
            private int depth;

            public VisitFolder(Set<Path> keys, Integer maxDepth)
            {
                this.keys=keys;
                this.maxDepth = maxDepth;
            }

            /**
             *
             * @param dir
             * @param attrs
             * @return
             * @throws IOException
             */
             /*
             * Ignore some dirs
             * @param dir
             * @param attrs
             * @return
             * @throws IOException
             */
            public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)
                    throws IOException
            {
                try
                {
                    if (dir.toFile().getName().equals(".AppleDouble"))
                    {
                        return FileVisitResult.SKIP_SUBTREE;
                    }
                    else if (dir.toString().equals("/proc"))
                    {
                        return FileVisitResult.SKIP_SUBTREE;
                    }
                    else if (dir.toString().equals("/dev"))
                    {
                        return FileVisitResult.SKIP_SUBTREE;
                    }
                    else if (RecycleBinFolderNames.isMatch(dir.toFile().getName()))
                    {
                        MainWindow.logger.log(Level.SEVERE, "Ignoring " + dir.toString());
                        return FileVisitResult.SKIP_SUBTREE;
                    }
                    else if (dir.toString().toLowerCase().endsWith(".tar"))
                    {
                        return FileVisitResult.SKIP_SUBTREE;
                    }

                    depth++;

                    if(depth > maxDepth)
                    {
                        depth--;
                        return FileVisitResult.SKIP_SUBTREE;
                    }
                    keys.add(dir);
                    return super.preVisitDirectory(dir, attrs);
                }
                catch(IOException e)
                {
                    MainWindow.logger.warning("Unable visit dir:"+dir + ":"+e.getMessage());
                    return FileVisitResult.SKIP_SUBTREE;
                }
            }


            /**
             *
             * Tar check due to http://stackoverflow.com/questions/14436032/why-is-java-7-files-walkfiletree-throwing-exception-on-encountering-a-tar-file-o/14446993#14446993
             * SONGKONG-294:Ignore exceptions if file is not readable
             *
             * @param file
             * @param exc
             * @return
             * @throws IOException
             */
            @Override
            public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException
            {

                if (file.toString().endsWith(".tar")) {
                    //We dont log to reports as this is a bug in Java that we are handling not a problem in SongKong
                    MainWindow.logger.log(Level.SEVERE, exc.getMessage());
                    return FileVisitResult.CONTINUE;
                }

                try
                {
                    FileVisitResult result = super.visitFileFailed(file, exc);
                    return result;
                }
                catch(IOException e)
                {
                    MainWindow.logger.warning("Unable to visit file:"+file + ":"+e.getMessage());
                    return FileVisitResult.CONTINUE;
                }
            }

            /**
             * SONGKONG-294:Ignore exception if folder is not readable
             *
             * @param dir
             * @param exc
             * @return
             * @throws IOException
             */
            @Override
            public FileVisitResult postVisitDirectory(Path dir, IOException exc)
                    throws IOException
            {
                depth--;
                try
                {
                    FileVisitResult result = super.postVisitDirectory(dir, exc);
                    return result;
                }
                catch(IOException e)
                {
                    MainWindow.logger.warning("Unable to count files in dir(2):"+dir);
                    return FileVisitResult.CONTINUE;
                }
            }
        }

        public CreateFolderTree(Path treeRoot)
        {
            this.treeRoot = treeRoot;
        }

        public String start(int depth)
        {
            VisitFolder visitFolder;
            try
            {

                if(treeRoot==null)
                {
                    for (Path path : FileSystems.getDefault().getRootDirectories())
                    {
                        visitFolder = new VisitFolder(keys, depth);
                        Files.walkFileTree(path, visitFolder);
                    }
                }
                else
                {
                    visitFolder = new VisitFolder(keys, depth);
                    Files.walkFileTree(treeRoot, visitFolder);
                }

                PathWalker2 pw = new PathWalker2();
                for (Path key : keys)
                {
                    //SONGKONG-505: Illegal character in Filepath problem prevented reportFile creation
                    try
                    {
                        pw.addPath(key);
                    }
                    catch (InvalidPathException ipe)
                    {
                        MainWindow.logger.log(Level.SEVERE, ipe.getMessage(), ipe);
                    }
                }
                Gson gson = new GsonBuilder().create();
                return gson.toJson(createJsonData(pw.getRoot()));
            }
            catch (Exception e)
            {
                handleException(e);
            }
            return "";
        }

        public void handleException(Exception e)
        {
            MainWindow.logger.log(Level.SEVERE, "Unable to count files:"+e.getMessage(), e);
            Errors.addError("Unable to count files:"+e.getMessage());
            MainWindow.logger.log(Level.SEVERE, e.getMessage());
            Counters.getErrors().getCounter().incrementAndGet();
            SongKong.refreshProgress(FixSongsCounters.SONGS_ERRORS);
        }

        /**
         * Add this node and recursively its children,  returning json data representing the tree
         *
         * @param node
         * @return
         */
        private Data createJsonData(PathWalker2.Node node)
        {
            Data data = new Data();
            if(node.getFullPath()!=null)
            {
                data.setId(node.getFullPath().toString());
                if(node.getFullPath().getFileName()!=null)
                {
                    MainWindow.logger.severe("AddingNode(1):"+node.getFullPath().getFileName().toString());
                    data.setText(node.getFullPath().getFileName().toString());
                }
                else
                {
                    MainWindow.logger.severe("AddingNode(2):"+node.getFullPath().toString());
                    data.setText(node.getFullPath().toString());
                }
            }
            else
            {
                try
                {
                    data.setText(java.net.InetAddress.getLocalHost().getHostName());
                    data.setId("#");
                    State state = new State();
                    state.setOpened(true);
                    data.setState(state);
                }
                catch(UnknownHostException uhe)
                {
                    data.setText("Server");
                }
            }

            //Recursively add each child folder of this node
            Map<String, PathWalker2.Node> children = node.getChildren();
            if(children.size()>0)
            {
                data.setChildren(new ArrayList<>());
                for (Map.Entry<String, PathWalker2.Node> next : children.entrySet())
                {
                    data.getChildren().add(createJsonData(next.getValue()));
                }
            }
            else
            {
                data.setBooleanchildren(true);
            }
            return data;
        }

        public static String createFolderJsonData(Request request, Response response)
        {
            if(Strings.nullToEmpty(request.queryParams("id")).equals("#"))
            {
                CreateFolderTree cft = new CreateFolderTree(null);
                String treeData = cft.start(1).replace("booleanchildren", "children");
                return treeData;
            }
            else
            {
                CreateFolderTree cft = new CreateFolderTree(Paths.get(request.queryParams("id")));
                String treeData = cft.start(2    ).replace("booleanchildren", "children");
                return treeData;
            }
        }

    }


    import java.nio.file.Path;
    import java.util.Collections;
    import java.util.Map;
    import java.util.TreeMap;

    /** Constructs a tree of folders based on a list of filepaths
     *
     * i.e a give it a list if all folder that  contain files that have been modified and it creates a hierachy
     * that can then be used to generate a data structure for use by jstree
     *
     */
    public class PathWalker2
    {
        private final Node root;


        public PathWalker2()
        {
            root = new Node();
        }

        public Node getRoot()
        {
            return root;
        }

        /**
         * Represent a node on the tree (may/not have children)
         */
        public static class Node
        {
            //Keyed on name and node
            private final Map<String, Node> children = new TreeMap<>();

            private Path fullPath;

            public Node addChild(String name)
            {

                if (children.containsKey(name))
                    return children.get(name);

                Node result = new Node();
                children.put(name, result);
                return result;
            }

            public Map<String, Node> getChildren()
            {
                return Collections.unmodifiableMap(children);
            }

            public void setFullPath(Path fullPath)
            {
                this.fullPath = fullPath;
            }

            public Path getFullPath()
            {
                return fullPath;
            }
        }

        /**
         * @param path
         */
        public void addPath(Path path)
        {
            Node node = root.addChild((path.getRoot().toString().substring(0, path.getRoot().toString().length() - 1)));

            //For each segment of the path add as child if not already added
            for (int i = 0; i < path.getNameCount(); i++)
            {
                node = node.addChild(path.getName(i).toString());
            }

            //Set full path of this node
            node.setFullPath(path);
        }


    }
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
  • https://stackoverflow.com/questions/7885073/not-able-to-display-special-characters I don't think this will answer your full question, but it might help. – Remixt Feb 15 '18 at 17:53
  • 1
    @Remixt I think that explains why debugging output is not quite right, but doesnt explain why json data is actually different. – Paul Taylor Feb 15 '18 at 17:56
  • Sorry about that, I'll keep looking around. – Remixt Feb 15 '18 at 17:59

3 Answers3

1

For html you either need to set a proper charset matching your needs or better stick with ASCII and use html-encoding for all non-ASCII characters. This works even if no specific charset is defined for you html display.

https://en.wikipedia.org/wiki/Unicode_and_HTML

MrSmith42
  • 9,961
  • 6
  • 38
  • 49
  • I don't understand why is data returned form linux machine different to data returned fro Windows machine (apart fro the obvious /, \ differences) – Paul Taylor Feb 15 '18 at 18:01
  • @Kayaman: The problem is only in your log-output? Your two OSes have simply different system charsets (Linux most likely UTF8). – MrSmith42 Feb 15 '18 at 18:01
  • No, the log output is unimportant apart form to indicate there is a difference, the problem is the html is shown differently, I assume because the json is different – Paul Taylor Feb 15 '18 at 18:03
  • Sorry I missed the second part of the output. @PaulTaylor The log output **is** important, because whereas on windows it shows that UTF-8 bytes are being interpreted as CP1252, on Linux it's really unclear what's happening. It almost looks like there's a "triple" encoding happening, so `UTF-8` bytes are interpreted first as `ISO-8859-1` and **those** bytes are interpreted as `UTF-8` again, resulting in the replacement characters `??`. – Kayaman Feb 15 '18 at 18:05
  • @Kayaman Ive fixed the Logger issue so you can see what toString() is actually returning, does that help – Paul Taylor Feb 15 '18 at 18:56
  • @PaulTaylor well, yes and no. That `�` (of which there are 2) is the [Unicode replacement character](https://stackoverflow.com/questions/1488866/how-to-replace-%C3%AF-%C2%BD-in-a-string) in `UTF-8`, so it doesn't really change things. It does mean that the data is corrupted before it's being logged though. – Kayaman Feb 15 '18 at 19:00
  • @Kayaman I have solved the original issue by adding -Dsun.jnu.encoding=UTF-8 but it highlights another issue and im not clear why it (partialy) works, question updated. – Paul Taylor Feb 15 '18 at 19:52
1

It seems that your debug output goes through multiple conversions between charsets. The text you send to the console seems to be converted to bytes using UTF-8 as encoding resulting into the conversion from ô to ô. Then there seems to be another conversion from the byte-data back into characters using the system's charset. Windows' console uses cp1252 as charset while Linux has different settings on a per installation basis. In your case it seems to be ASCII leading to the conversion to the two ? for the UTF-8 encoded data because these bytes have values that aren't defined in ASCII.

I don't know the logging framework you're using or what the specific setup of the Logger is you're using, so I can't tell you how to fix that, but for the Linux-variant, you might check the console's charset and change it to UTF-8 to see if that has the desired effect.

Lothar
  • 5,323
  • 1
  • 11
  • 27
  • I don't understand why this was downvoted. This is spot on (although it doesn't resolve the issue, just explains it). – Kayaman Feb 15 '18 at 18:43
  • @Lothar Ive fixed the Logger issue so you can see what toString() is actually returning, does that help – Paul Taylor Feb 15 '18 at 18:56
  • Your logger is writing to a file, so you're looking at it using a file-reader like `less` or an app like Notepad or gedit? Then another layer of complexity comes into place because the editor itself does conversions as well before doing the output. BTW: `toString` has nothing to do with all of this, because this is working on text without any charset-conversions happening and because the HTML file contains the characters correctly I ruled out that the corrupted characters are already coming from the `File`-object. Can you provide a small logfile containing the problematic characters (as ZIP)? – Lothar Feb 15 '18 at 19:04
  • @Lothar Ive updated question with the contents in term of hexcadecimal, I guess that is what you meant. – Paul Taylor Feb 15 '18 at 19:14
  • `EF BF BD` is the Unicode Replacement Character that is used if the decoding can't be done because of invalid data. I'm not sure if I can solve that for you without having access to the whole "shebang" (which I don't want ;-) – Lothar Feb 15 '18 at 19:20
  • @Lothar I have solved the original issue by adding -Dsun.jnu.encoding=UTF-8 but it highlights another issue and im not clear why it (partialy) works, question updated. – Paul Taylor Feb 15 '18 at 19:51
  • @PaulTaylor what does the linux system show for the `locale` command? – Kayaman Feb 15 '18 at 20:47
  • @Kayaman it show just LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= how do i set it properly ? – Paul Taylor Feb 15 '18 at 20:50
  • You could try `export LC_ALL=en_US.UTF-8` to set a nice modern locale, and see how it works with that. – Kayaman Feb 15 '18 at 20:57
  • @Kayaman Thats it - simply adding export LC_ALL=en_US.UTF-8 to my script and everything works and I no longer need sun.jnu.encoding, create a answer and I will mark it as correct. – Paul Taylor Feb 16 '18 at 10:49
1

So as always with encoding problems this has been a lot of work to debug. Not only are there a lot of different things that affect it, they also affect it at different times, so the first task is always to check where does it go wrong first.

As the deal with the � showed, once it goes wrong, it can then go more wrong and if you try to debug starting from the end result, it's like peeling layers from a rotten onion.


In this case the root of the problem was in the OS locale, which was set to POSIX. This old standard makes your OS act like it's from the 70's, with ASCII encoding and other outdated details. The ASCII encoding will prevent the OS from understanding filenames, text or anything containing more exotic characters. This causes weird issues because the JVM is doing just fine by itself, but any time it communicates with the OS (printing to a text file, asking to open a file with a certain name) there's a chance of corruption because the OS doesn't understand what the JVM is saying.

It's like someone is talking to you and every once in a while he puts a word of Chinese in there. You're writing down what he says in English, but every Chinese word you replace with "Didn't understand???".

The locale (in /etc/default/locale) usually contains sane defaults, but as we saw here, you can't always trust that. For any modern systems you'll want locale values like en_EN.UTF-8. You never want to see POSIX there in this day and age.

Kayaman
  • 72,141
  • 5
  • 83
  • 121