5

Currently i am tasked with making a tool that can check whether a link is correct or not using java. The link is fed from Jericho HTML Parser, and my job is only to check whether the file is exist / the link is correct or not. That part is done, the hard part is to optimize it, since my code run (i have to say) rather sluggishly on 65ms per run

public static String checkRelativeURL(String originalFileLoc, String relativeLoc){
        StringBuilder sb = new StringBuilder();
        String absolute = Common.relativeToAbsolute(originalFileLoc, relativeLoc); //built in function to replace the link from relative link to absolute path
        sb.append(absolute);
        sb.append("\t");
        try {
            Path path = Paths.get(absolute);
            sb.append(Files.exists(path));
        }catch (InvalidPathException | NullPointerException ex) {
            sb.append(false);
        }
        sb.append("\t");
        return sb.toString();
    }

and on this line it took 65 ms

Path path = Paths.get(absolute);
sb.append(Files.exists(path));

I have tried using

File file = new File(absolute);
sb.append(file.isFile());

It's still ran around 65~100ms.

So is there any other faster way to check whether a file exists or not other than this?

Since i am processing more than 70k html files and every milliseconds counts, thanks :(

EDIT:

I tried listing all the files into some List, and it doesn't really helps since it take more than 20mins just to list all the file....

The code that i use to list all the file

static public void listFiles2(String filepath){
        Path path = Paths.get(filepath);
        File file = null;
        String pathString = new String();
        try {
            if(path.toFile().isDirectory()){
                DirectoryStream<Path> stream = Files.newDirectoryStream(path);
                for(Path entry : stream){
                    file = entry.toFile();
                    pathString = entry.toString();
                    if(file.isDirectory()){
                        listFiles2(pathString);
                    }
                    if (file.isFile()){
                        filesInProject.add(pathString);
                        System.out.println(pathString);
                    }
                }
                stream.close();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
Coder
  • 1,917
  • 3
  • 17
  • 33
Ega Rinaldo
  • 99
  • 3
  • 12
  • `every milliseconds counts` - you don't want to be throwing/catching exceptions then – Scary Wombat Sep 01 '16 at 04:40
  • 4
    since you're "processing more than 70k html files" can you just load the whole directory tree and check? – phuclv Sep 01 '16 at 04:43
  • 2
    If you do it twice, does the second run still take 65ms? – Andreas Sep 01 '16 at 04:44
  • @Andreas Yes, it still take around 65ms, there's no problem with the parser actually, the problem probably lies when i load the file :/ – Ega Rinaldo Sep 01 '16 at 04:45
  • Do not instantiate a new StringBuilder each method call, pass the SB in and use `StringBulder.setLength(0).;` – Scary Wombat Sep 01 '16 at 04:49
  • @ScaryWombat Oh? I don't know if adding exceptions would effect the performance, but before when i was still using `new File(absolute).isFile()` it ran around 65ms too :(. – Ega Rinaldo Sep 01 '16 at 04:50
  • 1
    If an exception is thrown, it will take longer. – Scary Wombat Sep 01 '16 at 04:51
  • 1
    If an exception is thrown it will take *slightly* longer, but this is insignificant compared to the cost of the system call and examining the directory for the presence of the file. – user207421 Sep 01 '16 at 04:53
  • If you're checking 70k files, surely caching would be viable vs checking them all each time? – Rogue Sep 01 '16 at 04:59
  • I'd try out the suggestion by @LưuVĩnhPhúc. Load the entire directory tree into a hash of some sort up front then use that to decide if the file exists. It won't catch changes to the file system during the entire process but it's not clear you're looking for absolutely up-to-the-second snapshots. Not saying it will improve your situation but it's worth benchmarking. – paxdiablo Sep 01 '16 at 04:59
  • @EJP Would it help if i list all the files in the directory that i am searching into a `Collection`? I tried using `Collections` once for listing unique CSS ID in the project but i ran to a java heap error. – Ega Rinaldo Sep 01 '16 at 05:00
  • @paxdiablo hmm like putting them up in a `Collections`? Tried that once for listing unique IDs and ran to a java heap memory error when it was finished parsing 6k files. I'll try it though – Ega Rinaldo Sep 01 '16 at 05:04
  • 1
    You may want to try change your hard disk to SSD. – Minh Sep 01 '16 at 05:30
  • @Minh Yeah it surely will help solving the help the problem in my PC but not in other people PC though. – Ega Rinaldo Sep 01 '16 at 06:39
  • (1) Of course you store them already in a DB, so you may live by checking non-present URLs, and defer checking present URLs after some time (last check + #referrals since). (2) Load directory once (every day?) and keep a [directory watch service](https://docs.oracle.com/javase/tutorial/essential/io/notification.html) for the large directories? (3) Let a `sync` run regularly and use its results. – Joop Eggen Sep 01 '16 at 06:58
  • 1
    @EiZenHoweLL: Then you can try using multi-threading. – Minh Sep 01 '16 at 07:46
  • Did you try `Files.walk(...)` ? – Brian Kent Dec 05 '16 at 23:29

1 Answers1

1

If you know in advance the target OS set (usually it is the case), ultimately the fastest way will be to list so many files through a shell, by invoking a process e.g. using Runtime.exec.

On Windows you can do with

dir /s /b   

On Linux

ls -R -1

You can check what is the OS and use appropriate command (error or resort to directory stream if not supported).

If you wish simplicity and don't need to report a progress, you can avoid dealing with the process IO and store the list to a temporary file e.g. ls -R -1 > /tmp/filelist.txt. Alternatively, you can read from the process output directly. Read with a buffered stream, a reader or alike, with large enough buffer.

On SSD it will complete in a blink of an eye and on modern HDD in seconds (half million files is not a problem with this approach).

Once you have the list, you can approach it differently depending on maximum files count and memory requirements. If requirements are loose, e.g. desktop program, you can do with very simple code e.g. pre-loading the complete file list to a HashSet and check existence when needed. Shortening path by removing common root will require much less memory. You can also reduce memory by keeping only filename hash instead of full name (common root removal will probably reduce more).

Or you can optimize it further if you wish, the question just reduces now to a problem of checking existense of a string in a list of strings stored in memory or file, which has many well known optimal solutions.

Bellow is very loose, simplistic sample for Windows. It executes dir on HDD (not SSD) drive root with ~400K files, reads the list and benchmarks (well, kind of) time and memory for string set and md5 set approaches:

public static void main(String args[]) throws Exception {

    final Runtime rt = Runtime.getRuntime();
    System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
            / (1024 * 1024) + " Mb");

    long time = System.currentTimeMillis();
    // windows command: cd to t:\ and run recursive dir
    Process p = rt.exec("cmd /c \"t: & dir /s /b   > filelist.txt\"");
    if (p.waitFor() != 0)
        throw new Exception("command has failed");
    System.out.println("done executing shell, took "
            + (System.currentTimeMillis() - time) + "ms");
    System.out.println();

    File f = new File("T:/filelist.txt");

    // load into hash set
    time = System.currentTimeMillis();
    Set<String> fileNames = new HashSet<String>(500000);
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(f), StandardCharsets.UTF_8),
            50 * 1024 * 1024)) {
        for (String line = reader.readLine(); line != null; line = reader
                .readLine()) {
            fileNames.add(line);
        }
    }
    System.out.println(fileNames.size() + " file names loaded took "
            + (System.currentTimeMillis() - time) + "ms");
    System.gc();
    System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
            / (1024 * 1024) + " Mb");

    time = System.currentTimeMillis();
    // check files
    for (int i = 0; i < 70_000; i++) {
        StringBuilder fileToCheck = new StringBuilder();
        while (fileToCheck.length() < 256)
            fileToCheck.append(Double.toString(Math.random()));
        if (fileNames.contains(fileToCheck))
            System.out.println("to prevent optimization, never executes");
    }
    System.out.println();
    System.out.println("hash set 70K checks took "
            + (System.currentTimeMillis() - time) + "ms");
    System.gc();
    System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
            / (1024 * 1024) + " Mb");

    // Test memory/performance with MD5 hash set approach instead of full
    // names
    time = System.currentTimeMillis();
    Set<String> nameHashes = new HashSet<String>(50000);
    MessageDigest md5 = MessageDigest.getInstance("MD5");
    for (String name : fileNames) {
        String nameMd5 = new String(md5.digest(name
                .getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
        nameHashes.add(nameMd5);
    }
    System.out.println();
    System.out.println(fileNames.size() + " md5 hashes created, took "
            + (System.currentTimeMillis() - time) + "ms");
    fileNames.clear();
    fileNames = null;
    System.gc();
    Thread.sleep(100);
    System.gc();
    System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
            / (1024 * 1024) + " Mb");

    time = System.currentTimeMillis();
    // check files
    for (int i = 0; i < 70_000; i++) {
        StringBuilder fileToCheck = new StringBuilder();
        while (fileToCheck.length() < 256)
            fileToCheck.append(Double.toString(Math.random()));
        String md5ToCheck = new String(md5.digest(fileToCheck.toString()
                .getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
        if (nameHashes.contains(md5ToCheck))
            System.out.println("to prevent optimization, never executes");
    }
    System.out.println("md5 hash set 70K checks took "
            + (System.currentTimeMillis() - time) + "ms");
    System.gc();
    System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
            / (1024 * 1024) + " Mb");
}

Output:

mem 3 Mb
done executing shell, took 5686ms

403108 file names loaded took 382ms
mem 117 Mb

hash set 70K checks took 283ms
mem 117 Mb

403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb
Community
  • 1
  • 1
Fedor Losev
  • 3,244
  • 15
  • 13