Text Justification Algorithm Memory Leak and also incorrect amount of characters per line

Question

I'm having trouble implementing a text justification algorithm on a large book.The program is able to take in smaller passages but once I load in the whole book I get a memory leak. More importantly, my algorithm is not putting the right amount of characters on a single line. I'm not quite sure why this is but if someone can please take a look at this and help me figure it out it will be greatly appreciated!

public class TextJustification {

 public String justify(String words[], int width) {

    int cost[][] = new int[words.length][words.length];

    //next 2 for loop is used to calculate cost of putting words from
    //i to j in one line. If words don't fit in one line then we put
    //Integer.MAX_VALUE there.
    for (int i = 0; i < words.length; i++) {
        cost[i][i] = width - words[i].length();
        for (int j = i + 1; j < words.length; j++) {
            cost[i][j] = cost[i][j - 1] - words[j].length() - 1;
        }
    }

    for (int i = 0; i < words.length; i++) {
        for (int j = i; j < words.length; j++) {
            if (cost[i][j] < 0) {
                cost[i][j] = Integer.MAX_VALUE;
            } else {
                cost[i][j] = (int) Math.pow(cost[i][j], 2);
            }
        }
    }

    //minCost from i to len is found by trying
    //j between i to len and checking which
    //one has min value
    int minCost[] = new int[words.length];
    int result[] = new int[words.length];
    for (int i = words.length - 1; i >= 0; i--) {
        minCost[i] = cost[i][words.length - 1];
        result[i] = words.length;
        for (int j = words.length - 1; j > i; j--) {
            if (cost[i][j - 1] == Integer.MAX_VALUE) {
                continue;
            }
            if (minCost[i] > minCost[j] + cost[i][j - 1]) {
                minCost[i] = minCost[j] + cost[i][j - 1];
                result[i] = j;
            }
        }
    }
    int i = 0;
    int j;

    System.out.println("Minimum cost is " + minCost[0]);
    System.out.println("\n");
    //finally put all words with new line added in
    //string buffer and print it.
    StringBuilder builder = new StringBuilder();
    do {
        j = result[i];
        for (int k = i; k < j; k++) {
            builder.append(words[k] + " ");
        }
        builder.append("\n");
        i = j;
    } while (j < words.length);

    return builder.toString();
}

public static void main(String args[]) throws IOException {

    File read = new File("TaleOfTwoCities.txt");
    Scanner in = new Scanner(read);

    ArrayList<String> temporary = new ArrayList<String>();

    while (in.hasNext()) {
        temporary.add(in.next());
    }

    String[] words1 = temporary.toArray(new String[temporary.size()]);


    //String words1[] = {"I", "am", "so", "stuck,", "please,", "help", "me"};
    TextJustification awl = new TextJustification();
    System.out.println(awl.justify(words1, 60));
}
}

Here is my code I hope someone is able to help me out as I have been scratching my head for a few days. Also the link to the txt file I am trying to parse is https://www.dropbox.com/s/5sy5zp4n3b6wgfz/TaleOfTwoCities.txt?dl=0 Thanks again guys and hope someone can help out!

EDIT: This is an image of how I am trying to justify the text since I didn't make that clear enough before: https://www.dropbox.com/s/f9xt83nflwj1q5p/project1.png?dl=0

You should be more explicit about what you mean is a "memory leak". OutOfMemoryException? Stack overflow? ...?? — Highbrainer, Mar 15 '19 at 13:44
*Running out* of memory is not the same thing as a *memory leak*. Pure Java code does not leak memory unless the JVM is buggy. It's possible that your program is holding on to objects that it doesn't need any more, which is similar to a leak, but it's more likely that your VM just needs more memory to work with to hold the data of "a large book". — John Bollinger, Mar 15 '19 at 13:46
@Highbrainer Im sorry the error is java.lang.OutOfMemoryError — Danny Qanaah, Mar 15 '19 at 14:02
@JohnBollinger My apologies I should've clarified it's a java.lang.OutOfMemoryError. I know parsing this data is definitely possible, but I'm not sure how to fix my issue. Thanks — Danny Qanaah, Mar 15 '19 at 14:04
"Pure Java code does not leak memory" is not true. You can have JVM holding objects that cannot be referenced from running code that wont be collected by GC. — Raven221221221, Mar 15 '19 at 14:04
Pure Java _can_ leak memory. There is a nice example here on SO using a `ClassLoader` repeatedly. — Raven221221221, Mar 15 '19 at 14:21
@Raven221221221, I said that if pure Java leaks memory then that constitutes a JVM bug, not that pure Java never leaks memory. However, it's still unclear to me what scenario you have in mind. There are so-called "ClassLoader leaks", if that's what you mean, but for these (1) it is not clear that the situations in which they arise involve objects that cannot be reached from any running code, and (2) it is *certainly* unclear that if the objects indeed can't be reached, then protecting them from GC should not be considered buggy. — John Bollinger, Mar 15 '19 at 14:37
@JohnBollinger I think I quoted you correctly "Pure Java code does not leak memory unless the JVM is buggy." (1) not sure what you mean by _not clear_ - these objects are actually unreachable. (2) I would think it works as designed - I have never read/seen any mention of this being thought of as a bug. This `ClassLoader` case is pure Java and every JVM works like this, leading me to believe this is not a bug. SO OP https://stackoverflow.com/questions/6470651/creating-a-memory-leak-with-java — Raven221221221, Mar 15 '19 at 15:12

ruakh · Answer 1 · 2019-03-15T14:38:27.057

1

The program is able to take in smaller passages but once I load in the whole book I get a memory leak.

I don't think your code has an actual "memory leak" — that is, your code isn't holding on to any object references that your algorithm doesn't actually need. Rather, it's just that your algorithm needs a lot of memory when applied to a large text.

Specifically, the problem is that you're using the algorithm for something it wasn't designed for. English prose is broken into paragraphs that are separated by line breaks (plus some extra horizontal and/or vertical space) and justified separately. Your code is eliminating all line-breaks in the entire novel, and trying to justify the entire thing as a single enormous paragraph. If that's intentional, then you'll probably want to use a less-expensive algorithm that provides less-perfect justification but doesn't require as much memory and computation.

More importantly, my algorithm is not putting the right amount of characters on a single line.

The algorithm (intentionally) allows the last line of the paragraph to be any length. That is, again, in keeping with the conventions of English prose. If that's not what you want, you'll need to make several adjustments to the algorithm. (Before you do that, you'll want to spend some time making sure you understand the algorithm.)

Edited to add (per comments):

Yes, that is completely what I want, to strip the text of all formatting and apply a line break once words of maximum 60 characters are in the line. I figured that the last line being able to hold any amount is fine since the rest of the text before would be formatted correctly, so the last line would always be a maximum of 60 characters. You speak of a less expensive algorithm, where can I find that?

The less expensive algorithm is to keep a variable currentLineLength (initially zero), and for each word:

If (currentLineLength == 0 ? 0 : currentLineLength + 1) + word.length() <= 60, print a space, print the word, and update currentLineLength to (currentLineLength == 0 ? 0 : currentLineLength + 1) + word.length().
Otherwise:
- While word.length() > 60:
  - Print a newline, print word.substring(0, 59) (= the first 59 characters of word), print a hyphen.
  - Update word = word.substring(59).
- Print a newline, print word, and update currentLineLength to word.length().

edited Mar 15 '19 at 14:38

answered Mar 15 '19 at 14:02

ruakh

175,680
26
273
307

Yes, that is completely what I want, to strip the text of all formatting and apply a line break once words of maximum 60 characters are in the line. I figured that the last line being able to hold any amount is fine since the rest of the text before would be formatted correctly, so the last line would always be a maximum of 60 characters. You speak of a less expensive algorithm, where can I find that? – Danny Qanaah Mar 15 '19 at 14:15
@DannyQanaah: I can't even begin to imagine *why* you would want to convert the entire book, chapter headings and all, into a single paragraph; but if that's what you want, then the behavior you want is pretty simple: just keep track of the current line length, and whenever adding a word would put you over 60 characters, finish the current line and start a new one. The only special case you need to worry about is, what if a single word is more than 60 characters? In that case you'll probably want to force-hyphenate it after 59 characters and leave the rest of the word for the next line. – ruakh Mar 15 '19 at 14:31
Thank you @ruakh You have been helpful and hopefully I can implement this. Also, to be honest this is a programming assignment that I have to complete so I have to adhere to rules specified. I also don't know why anyone would want to do something like this – Danny Qanaah Mar 15 '19 at 14:46
So, @ruakh, it turns out that it *is* true that the OP really only needs to hold on to a few words at a time. Only one, really. – John Bollinger Mar 15 '19 at 14:50
@DannyQanaah: My pleasure! – ruakh Mar 15 '19 at 15:25
@JohnBollinger: Both of our answers suggested that the OP consider using a simpler algorithm that doesn't balance as well. The problem with your answer is that it simply *assumed* that the sophisticated DP algorithm was a bug. (I don't believe for a second that you understood what it was for.) The OP turned out to be OK with our suggestion; if you want to brag about that, don't let me stop you. – ruakh Mar 15 '19 at 15:34
@ruakh, what I assumed was that the task was a workable exercise. No, I did not analyze the original algorithm past seeing that it required an unworkable amount of memory. I *concluded* that the algorithm was a bug on the basis that the problem -- assumed solvable -- could not reasonably be solved *that way*. – John Bollinger Mar 15 '19 at 15:43
@JohnBollinger: There are multiple ways that could be workable. (The obvious one is to preserve the novel's paragraphs and justify each independently.) So your conclusion did not follow from your assumption. – ruakh Mar 15 '19 at 15:55

John Bollinger · Answer 2 · 2019-03-15T14:47:47.483

You do not have a memory leak. You are simply trying to use more memory than the JVM has available or can obtain. Sometimes a good solution to such a problem is simply to allow the VM to use more memory (this can be done via command-line options), but your particular program is flagrantly inefficient in its memory use, and I don't think you have any chance of making it work as written for large texts on anything short of a supercomputer.

First, you read the entire text into memory, as an ArrayList of separate Strings. This is pretty wasteful already, for you really need only to hold on to a few words at a time -- fewer than will fill two justified lines -- in order to compute the required justification.

But the real killer is this:

    int cost[][] = new int[words.length][words.length];

Your words is an array containing all the words in the work. For A Tale of Two Cities, that's about 135000 words, and you're creating a 2D array with the square of that number of elements, each of them four bytes wide. That would require around 73 GB of memory.

If you are at liberty to choose a different justification algorithm, then a good solution might be to switch to a line-by-line justification approach, enabling you to read only enough words at a time as needed, and to output each line as it is justified.

If you must use the current approach then you probably need to perform justification on smaller blocks of text -- around one tenth as many words at a time or fewer. For a book with distinct chapters or sections, it would make sense to justify on a chapter-by-chapter basis.

As for

More importantly, my algorithm is not putting the right amount of characters on a single line.

we cannot really address that, as you haven't specified how the right number of characters is supposed to be determined, or indeed any details of the specific form of justification you intend to implement.

Update

Per your comment on another answer, the justification rule you want to apply seems to be simply to greedily put as many words as possible onto each line, up to the specified maximum line length. The justification algorithm you have implemented does not do exactly that, however, and it is far more costly, both in memory and in processing time, than you need for the justification style you want.

You can and should perform a rewrite to use a simpler algorithm, as I described above. Read words from the input and pack them into lines as you go, inserting a space between words when the next one fits on the current line, and a line break instead when it doesn't. You don't really need to retain even a whole line in memory at a time, just a count of the current line length, and the single next word. As for getting the right number of characters, don't forget to count the spaces between words.

Re: "you really need only to hold on to a few words at a time -- fewer than will fill two justified lines -- in order to compute the required justification": That's not true. This algorithm uses a cost function to balance the entire text, not just two consecutive lines. In order to determine whether a given word should go at the end of one line or the beginning of the next, you may need to consider consequences that arise many lines later. — ruakh, Mar 15 '19 at 14:09
Well I suppose that's a question of what justification rules are in fact wanted / required @ruakh. The OP does not specify. I agree that my comments suppose that a line-by-line justification -- i.e. a different algorithm -- would be satisfactory, and I will momentarily update this answer to clarify. — John Bollinger, Mar 15 '19 at 14:15
I'm sorry I will update my post with exactly how I want the text justified. Here is an image of what I want, https://www.dropbox.com/s/f9xt83nflwj1q5p/project1.png?dl=0 — Danny Qanaah, Mar 15 '19 at 14:22

Text Justification Algorithm Memory Leak and also incorrect amount of characters per line

2 Answers2