Adding N line breaks in a paragraph for the narrowest result

Question

Let's say we have a paragraph such as this one:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Assuming a fixed width font, we want to add exactly N line breaks (by replacing space characters only) to produce a N+1 line block of text.

Here's an example of output for N=8, we get a max line width of 51:

Lorem ipsum dolor sit amet, consectetur adipiscing 
elit, sed do eiusmod tempor incididunt ut labore 
et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut 
aliquip ex ea commodo consequat. Duis aute irure 
dolor in reprehenderit in voluptate velit esse 
cillum dolore eu fugiat nulla pariatur. Excepteur 
sint occaecat cupidatat non proident, sunt in culpa 
qui officia deserunt mollit anim id est laborum.

How do we find which space characters to replace with line breaks to obtain the narrowest (least number of characters on the longest line) with the fewest attempts?

sorry, I am just trying to understand the question. Is N given? — vabii, Jun 02 '16 at 16:11
Oh yes, basically it would be a function where the paragraph string and N would be provided as inputs and the function would edit the paragraph string with the right line breaks. — Michel Rouzic, Jun 02 '16 at 16:13

score 6 · Answer 1 · answered Jun 02 '16 at 16:20

Suppose the text consists of m words, which we will number from 1 to m. Define f(i, j) to be the maximum width (in characters) of any line in an optimal (minimal-width) solution to the subproblem consisting of just the first i words, under the restriction that exactly j line breaks are used. Then the width of the best possible sequence of breaks to the entire problem will be given by f(m, n). This can be solved fairly efficiently using dynamic programming.

Let the total length in characters of the fragment between word i and word j >= i, inclusive, be len(i, j). (This is easy to calculate: Just compute an array of m+1 values len0[j], for 0 <= j <= m, each giving the total length in characters of the fragment consisting of the first j words; then len(i, j) is just len0[j] - len0[i-1] - 1, with the convention that len0[0] = -1.)

The base case is easy:

f(i, 0) = len(0, i)   (i.e., if there are no line breaks)

The recursive case is:

f(i, j) = the minimum over all 0 <= k < i of max(f(k, j-1), len(k+1, i))

That is, to find the best way to break the first i words into j+1 lines (i.e. using j line breaks), we can try the following for every shorter k-word prefix: determine the best way to break that k-word prefix into j lines (i.e. using j-1 line breaks), and compare the maximum width we get from that with the width that results from putting the rest of the i-k words on a single line at the end. Each prefix gives us a different candidate solution, so we can pick the best of all of them.

Constructing a solution

Now that we can calculate the optimal width f(m, n), how can we use this to actually construct a solution? Fortunately there is a standard technique in dynamic programming for this. The fastest way is to record, during calculation of f(i, j), the (actually a, since there may in general be multiple optimal solutions) value of k that produced the minimum in a predecessor table pred[i][j]. Having calculated f(m, n) and filled in the predecessor table, we can then construct an optimal solution by walking backwards through it: pred[i][j] tells us a value k such that we can produce an optimal solution by adding a line break after word k, so add a line break there, and then look at pred[k][j-1] to find the position of the previous line break, continuing on until j reaches 0.

Time and space complexity

If the recursion is memoised using dynamic programming, then there are at most O(mn) different parameter combinations that f() could be called with (i ranges between 0 and m, and j ranges between 0 and n), and the time spent outside of recursive calls is O(m) (k can range between 0 and m, and the computation for each value of k is O(1)), so the time complexity of this solution is O(nm^2). The space complexity is O(mn), but by switching to a bottom-up DP this can easily be reduced to O(m), since while computing f(i, j) we only ever need access to values of f() for which the second parameter is j-1, meaning it suffices to actually store just the size-(m+1) array of computed values f(q, j-1) for 0 <= q <= m.

I just finished writing the code and when I was going to post it, I saw your answer. You have described the procedure so well that I decided to not post my code. Upvoted for the explanation. :) — Vedang Mehta, Jun 02 '16 at 16:25
@VedangMehta: Thanks, but by all means post your answer! Code is useful to see :) — j_random_hacker, Jun 02 '16 at 16:26
I'd like to see some code too :) that might make it easier to understand! — Michel Rouzic, Jun 02 '16 at 16:27

score 5 · Accepted Answer · edited May 23 '17 at 11:52

5

(Adapted from here, How to partition an array of integers in a way that minimizes the maximum of the sum of each partition?)

If we consider the word lengths as a list of numbers, we can binary search the partition.

Our max length ranges from 0 to sum (word-length list) + (num words - 1), meaning the spaces. mid = (range / 2). We check if mid can be achieved by partitioning into N sets in O(m) time: traverse the list, adding (word_length + 1) to the current part while the current sum is less than or equal to mid. When the sum passes mid, start a new part. If the result includes N or less parts, mid is achievable.

If mid can be achieved, try a lower range; otherwise, a higher range. The time complexity is O(m log num_chars). (You'll also have to consider how deleting a space per part, meaning where the line break would go, features into the calculation.)

edited May 23 '17 at 11:52

Community

1
1

answered Jun 02 '16 at 17:29

גלעד ברקן

23,602
3
25
61

Nice, this will be faster than mine and easier to implement! (Except in the incredibly unlikely case that mn < log(num_chars)... Unlikely to happen with any human language ;-) – j_random_hacker Jun 02 '16 at 18:05
@j_random_hacker Thanks. I thought you said the DP can be reduced to O(m). – גלעד ברקן Jun 02 '16 at 18:50
That's the *space* complexity. – j_random_hacker Jun 02 '16 at 18:58
So basically just setting a threshold, filling each line up to that threshold, seeing if it's too little or too much and doing it over again with a threshold? That's almost disappointingly too simple ;). It can probably be made to take less attempts than a naive binary tree approach by doing more educated guesses of thresholds, I'll have to experiment with that. Btw would it really be the faster approach? I thought there might be a more direct approach, but I guess not. – Michel Rouzic Jun 02 '16 at 20:55
@MichelRouzic what does your data look like? That is, how many words in the paragraph we're optimizing, and the range in word length? – גלעד ברקן Jun 02 '16 at 22:56
Data is just typical blocks of English text, it's really for general purpose text displaying. My point is if you want 100 lines you don't need to start with ratios of 1/2, then 1/4, 1/8, 1/16, 1/32, and so on, that's a waste of time. You already know that 1/64 will likely be too much and 1/128 too little, you can start with an educated guess that it will be about 1/100 + I guess about half the length of a typical word and go up or down more or less depending on what you get. – Michel Rouzic Jun 02 '16 at 23:12
@MichelRouzic not sure. I guess I would experiment with different initial guesses. – גלעד ברקן Jun 02 '16 at 23:21
It seems that I can get within ±1 characters of the correct limit on the first estimation in most cases! And that first estimate is itself just a very few (2-4) characters from the lower bound, so in most cases if done right it might take only 1 or 2 attempts. I'll try it more extensively and post the resulting algorithm. – Michel Rouzic Jun 03 '16 at 09:17
Here's a plot of the difference between the fractional lower bound and the optimal threshold for a text depending on how many lines are wanted: http://i.imgur.com/yrCYzvs.png It seems it averages to half an average word (3.28 for the text used here), so that'll probably be the first guess then I'll go one up or down until I go one too far. – Michel Rouzic Jun 03 '16 at 18:06
In average? Well I'll use the actual average for the text, since I'm already counting the words and the total characters. And I made sure that the threshold couldn't be smaller than the longest word. – Michel Rouzic Jun 04 '16 at 00:46

RomCoo · Answer 3 · 2016-06-02T17:30:36.200

My try and error attempt. I am not quite sure if you always get the shortest line width, but the algorithm is fast and easy to understand/implement. And I think in most cases this should fit the needs

Let's assume you have M number of characters and you want to insert N linebreaks. Then you get the shortest possible linewidth: L_min = RoundToInfinity((M-N)/N+1) (N-M because we clear N spaces)
Fill each line with words, so that the linewidth is smaller or equal to L_min. This way your last line will contain a lot more characters then the others.
Now always search the largest line (at start this will be the last line), take the first word of it and put it at the and of the previous line. Repeat till first line is the longest.
At all time you should store your actual maximal linewidth L, so you can restore the situaltion when L was smallest.

Adding N line breaks in a paragraph for the narrowest result

3 Answers3

Constructing a solution

Time and space complexity