Remove comments from large files using Java

Question

I have .sh, .txt, .sql, .pkb etc files with file size greater than 10 MBs which means more than 100k lines.

I want to remove comments from these file and then use the uncommented content further. I have written the following code for it.

/**
 * Removes all the commented part from the file content as well as returns a
 * file structure which have just lines with declaration syntax for eg.
 * Create Package packageName <- Stores all decalartion lines as separate
 * string in an array
 * 
 * @param file
 * @return file content
 * @throws IOException
 */
private static String[] filterContent(File file) throws IOException {

    String withoutComment = "";
    String declare = "";
    String[] content;
    List<String> readLines = FileUtils.readLines(file);

    int size = readLines.size();
    System.out.println(file.getName() + " Files number of lines "+ size + " at "+new Date());
    String[] declareLines = new String[size];
    int startComment = 0;
    int endComment = 0;
    Boolean check = false;
    int j = 0;
    int i=0;
    // Reading content line by line
    for (String line:readLines) {
        // If line contains */ that means comment is ending in this line,
        // making a note of the line number
        if (line.toString().contains("*/")) {
            endComment = i;
            // Removing the content before */ from the line
            int indexOf = line.indexOf("*/");
            line = line.replace(line.substring(0, indexOf + 2), "");
        }

        // If startComment is assigned fresh value and end comment hasn't,
        // that means the current line is part of the comment
        // Ignoring the line in this case and moving on to the next one
        if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check)
            continue;

        // If line contains /* that means comment is starting in this line,
        // making a note of the line number
        if (line.contains("/*")) {
            startComment = i;
            // Removing the content after /* from the line
            int indexOf = line.indexOf("/*");
            line = line.replace(line.substring(indexOf), "");
            if (i == 0)
                check = true; // means comment in the very first line
        }

        // If line contains -- that means single line comment is present in
        // this line,
        // removing the content after --
        if (line.contains("--")) {
            int indexOf = line.indexOf("--");
            line = line.replace(line.substring(indexOf), "");
        }
        // If line contains -- that means single line comment is present in
        // this line,
        // removing the content after --
        if (line.contains("#")) {
            int indexOf = line.indexOf("#");
            line = line.replace(line.substring(indexOf), "");
        }

        // At this point, all commented part is removed from the line, hence
        // appending it to the final content
        if (!line.isEmpty())
            withoutComment = withoutComment + line + " \n";
        // If line contains CREATE its a declaration line, holding it
        // separately in the array
        if (line.toUpperCase().contains(("CREATE"))) {
            // If next line does not contains Create and the current line is
            // the not the last line,
            // then considering two consecutive lines as declaration line,
            if (i < size - 1 && !readLines.get(i + 1).toString().toUpperCase().contains(("CREATE"))) {
                declare = line + " " + readLines.get(i + 1).toString() + "\n";
            } else if (i < size) {// If the line is last line, including
                                    // that line alone.
                declare = line + "\n";
            }

            declareLines[j] = declare.toUpperCase();
            j++;
        }
        i++;
    }
    System.out.println("Read lines "+ new Date());
    List<String> list = new ArrayList<String>(Arrays.asList(declareLines));
    list.removeAll(Collections.singleton(null));

    content = list.toArray(new String[list.size() + 1]);

    withoutComment = withoutComment.toUpperCase();
    content[j] = withoutComment;
    System.out.println("Retruning uncommented content "+ new Date());
    return content;
}


 public static void main(String[] args) {
        String[] content = filterContent(new File("abc.txt"));
}

The problem with this code is its too slow if the file size is huge. For a 10 MB file it take more than 6 hours to remove comments. (Code ran on SSH server).

I can have files with size up to 100 MBs also, in which it takes days to remove comments. How can I remove comments faster?

Update : The question is not a duplicate as my problem is not just solved by changing way to read lines. Its the string activity making the process slow and I need a way to make the comment removal activities faster.

1. Don't keep the whole file in memory. 2. Why do you want to do that? — Axel, Feb 17 '17 at 07:02
First, don't put it into a List, use a InputStream to read the file and analyse the line directly. You can easily find if a line contain `/*` or `/* ... */`, remove this and recreating the new file without the comment. Reading a file of more than 100MB should never took that long ... — AxelH, Feb 17 '17 at 07:04
Possible duplicate of [How to read a large text file line by line using Java?](http://stackoverflow.com/questions/5868369/how-to-read-a-large-text-file-line-by-line-using-java) — AxelH, Feb 17 '17 at 07:44

score 0 · Answer 1 · answered Feb 17 '17 at 06:47

0

You may create several threads that do the work (proper splitting of your lines is required)

answered Feb 17 '17 at 06:47

kamehl23

522
3
6

The file may even have 50 lakhs lines. Won't Creating hundreds of threads overload the thread stack? – Harshita Sethi Feb 17 '17 at 07:20

score 0 · Answer 2 · answered Feb 17 '17 at 07:08

Some idea to get this code faster

Use an InputStream to read the file and analyse the line directly, store the new String in the uncommented file. This will prevent the multiple reading of the file (once to create the List<String> readLines, once done by your iteration)

Design, you could use a mapping for the comments syntax instead of this redondant code.

Once this would be done, this should be way faster. Off course, multithread could be a solution but this would required some check to be sure you don't split the file just in a comment block. So, first improve the code, then you could think of this.

Harshita Sethi · Accepted Answer · 2017-02-19T13:33:50.393

Turns out the biggest problem of my code was use of Strings. Reading lines by any method din't make much of a difference but using StringBuilder instead of String to store the uncommented line, changed the performance drastically. Now the same code with StringBuilder takes seconds to remove comments where it took hours earlier.

Here's the code. For better performance I've changed List to BufferedReader.

/**
     * Removes all the commented part from the file content as well as returns a
     * file structure which have just lines with declaration syntax for eg.
     * Create Package packageName <- Stores all decalartion lines as separate
     * string in an array
     * 
     * @param file
     * @return file content
     * @throws IOException
     */
    private static List<String> filterContent(File file) throws IOException {

        StringBuilder withoutComment = new StringBuilder();
//      String declare = "";
//      String[] content;
//      List<String> readLines = FileUtils.readLines(file);
//
//      int size = readLines.size();
        System.out.println(file.getName() + "  at " + new Date());
        List<String> declareLines = new ArrayList<String>();
        // String line = null;
        int startComment = 0;
        int endComment = 0;
        Boolean check = false;
        Boolean isLineDeclaration = false;

        int j = 0;
        int i = 0;

        InputStream in = new FileInputStream(file);
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        String line;
        // Reading content line by line
        while ((line = reader.readLine()) != null) {
            // for (int i = 0; i < size; i++) {
            // line = readLines.get(i).toString();// storing current line data
            // If line contains */ that means comment is ending in this line,
            // making a note of the line number
            if (line.toString().contains("*/")) {
                endComment = i;
                // Removing the content before */ from the line
                int indexOf = line.indexOf("*/");
                line = line.replace(line.substring(0, indexOf + 2), "");
            }

            // If startComment is assigned fresh value and end comment hasn't,
            // that means the current line is part of the comment
            // Ignoring the line in this case and moving on to the next one
            if ((startComment > 0 && endComment == 0) || (endComment < startComment) || check)
                continue;

            // If line contains /* that means comment is starting in this line,
            // making a note of the line number
            if (line.contains("/*")) {
                startComment = i;
                // Removing the content after /* from the line
                int indexOf = line.indexOf("/*");
                line = line.replace(line.substring(indexOf), "");
                if (i == 0)
                    check = true; // means comment in the very first line
            }

            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("--")) {
                int indexOf = line.indexOf("--");
                line = line.replace(line.substring(indexOf), "");
            }
            // If line contains -- that means single line comment is present in
            // this line,
            // removing the content after --
            if (line.contains("#")) {
                int indexOf = line.indexOf("#");
                line = line.replace(line.substring(indexOf), "");
            }

            // At this point, all commented part is removed from the line, hence
            // appending it to the final content
            if (!line.isEmpty())
                withoutComment.append(line).append(" \n");
            // If line contains CREATE its a declaration line, holding it
            // separately in the array
            if (line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.add(line.toUpperCase());

                isLineDeclaration = true;
                j++;
            } else if (isLineDeclaration && !line.toUpperCase().contains(("CREATE"))) {
                // If next line does not contains Create and the current line is
                // the not the last line,
                // then considering two consecutive lines as declaration line,
                declareLines.set(j - 1, declareLines.get(j - 1) + " " + line.toUpperCase());
                isLineDeclaration = false;
            }
            i++;
        }

        reader.close();
        System.out.println("Read lines " + new Date());
//      List<String> list = new ArrayList<String>(Arrays.asList(declareLines));
        declareLines.removeAll(Collections.singleton(null));

//      content = list.toArray(new String[list.size() + 1]);

//      withoutComment = withoutComment..toUpperCase();
        declareLines.add(withoutComment.toString().toUpperCase());
        System.out.println("Retruning uncommented content " + new Date());
        return declareLines;
    }

Remove comments from large files using Java

3 Answers3