4

I want to split a string into chunks, each of which is within a maximum character count, say 2000 and does not split a word.

I have tried doing as below:

text.chars.each_slice(2000).map(&:join)

but sometimes, words are split. I have tried some regex:

text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)

from this question, but I don't quite get how it works and it gives me some erratic behavior, sometimes giving chunks that only contain periods.

Any pointers will be greatly appreciated.

sawa
  • 165,429
  • 45
  • 277
  • 381
Muaad
  • 366
  • 2
  • 8
  • 1
    A Google search for "ruby wrap paragraph" yields 300,000 results: https://www.google.com/search?q=ruby+wrap+paragraph – Phlip Mar 03 '18 at 17:36
  • 1
    Show an explicit example of that "erratic behavior". – Stefan Pochmann Mar 03 '18 at 17:41
  • @StefanPochmann Something like this: `["Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes", ".", "nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium", "."]` – Muaad Mar 03 '18 at 17:46
  • @Phlip I am looking for that. I just have a very long text that I need to split given the conditions above. I am not looking for paragraphs. – Muaad Mar 03 '18 at 17:47
  • I'm pretty sure there isn't going to be a tiny, simple solution. My guess is you'll need to manually loop to find each break point. – Max Mar 03 '18 at 17:51
  • @Max No one-line regex solution? This one: `text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)` comes close. Works most of the time. – Muaad Mar 03 '18 at 17:56
  • 1
    How about `/.{1,2000}(?: |$)/`? – Stefan Pochmann Mar 03 '18 at 17:57
  • @StefanPochmann This `/.{1,2000}(?: |$)/` seems to work in my tests. Gave it a 8000 character string and it split it into chunks with these sizes 1993, 1996, 1991, 1997 and 19 and, as far as I could see, words seem to be intact too. I will put it in my code and give it many different strings and see how it will behave. Thanks. – Muaad Mar 03 '18 at 18:20
  • @StefanPochmann Your solution is closest to what I want. The only problem is that most chunks have character counts that are way lower than the maximum of 2000 characters. Tested with some really long strings and am getting chunks that are very short. Is there a way for character counts to be just short of 2000 for each chunk with your solution? – Muaad Mar 03 '18 at 18:40
  • @Muaad 1993, 1996, 1991 and 1997 aren't much lower than 2000, are they? If you do get much lower than 2000 most of the time, that must come from very different data. Which I can't see. – Stefan Pochmann Mar 03 '18 at 19:32
  • @StefanPochmann - There is a little more to it than that. The regex has to handle whitespace and conditions where a sequence is > 2000. –  Mar 03 '18 at 19:50
  • @sln Well, for other whitespace use `\s`. I somewhat assumed input is one paragraph, where you wouldn't have other whitespace. Not sure what you mean with sequence. – Stefan Pochmann Mar 03 '18 at 20:06
  • @StefanPochmann See example here: https://browserbot.muaad.me/pages/string_test – Muaad Mar 03 '18 at 20:20
  • 1
    So I guess you want to combine paragraphs? Try `text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip) ` – Stefan Pochmann Mar 03 '18 at 20:28
  • @StefanPochmann This last solution gives me the outcome I had in mind. Chunk sizes are now reasonably just under 2000, words are not split and the maximum character count of 2000 per chunk is maintained. Thanks. – Muaad Mar 03 '18 at 22:00
  • Yeah, it's easy if there are no line breaks or ws controls converted to a space, but your text is altered from the original. –  Mar 03 '18 at 22:36
  • @sln Yes. I lost some line breaks which would have made things easier to read. I am still looking into that. Am getting an error with your solution. I think yours will have worked. The regex is just too complicated for me to understand. Need to brush up on my regex. Thanks both of you. You have directed me in the right way. – Muaad Mar 03 '18 at 22:43

3 Answers3

3

Code

def max_groups(str, n)
  arr = []
  pos = 0     
  loop do
    break arr if pos == str.size
    m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
    return nil if m.nil?
    arr << m[0]
    pos += m[0].size
  end
end

Examples

str = "Now is the time for all good people to party"
  #    12345678901234567890123456789012345678901234
  #    0         1         2         3         4

max_groups(str, 5)
  #=> nil
max_groups(str, 6)
  #=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to 
max_groups(str, 10)
  #=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
  #=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
  #=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
  #=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
  #=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
  #=> ["Now is the time for all good people to party"]

str = "How        you do?"
  #    123456789012345678
  #    0         1

max_groups(str, 4)
  #=> ["How ", "    ", "   ", "you ", "do?"]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N}.

The example below uses 32 max per line.

https://regex101.com/r/8vAkOX/1

Update: To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.

(?s)(?:((?>.{1,32}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,32})(?:\r?\n)?|(?:\r?\n|$))

The chunks are in $1, and you could replace with $1\r\n to get a display
that looks wrapped.

Explained

 (?s) # Span line breaks
 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,32}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                     [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= \r? \n )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,32}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: \r? \n )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: \r? \n | $ )        # Stand alone linebreak or at EOS
 )
  • This works well. Thanks. Is it possible to have character counts closer to the maximum for each chunk? Chunk sizes are now all over the place. Like, if my maximum character count per chunk is 2000, character counts of just under 2000 would make things more consistent. Is there a way to control the chunk sizes so that they can be close to 2000? Thanks. – Muaad Mar 03 '18 at 19:37
  • @Muaad - I used `32` in my example. I can assure you that `{1,2000}` will work if your engine supports an upper range of _2000_. I use this exact regex in a commercial product. Make a test page for an online Ruby tester using this regex with your sample data... Post back that link. –  Mar 03 '18 at 19:41
  • @Muaad - Also, the regex is designed to get the _maximum_ ( <= 2000 ) chunk/line size without breaking up words. So, yes it is as close to 2000 as it gets. –  Mar 03 '18 at 19:45
  • I set up a page here (https://browserbot.muaad.me/pages/string_test) with some example text and how this regex splits it up. See the different chunk sizes. – Muaad Mar 03 '18 at 20:18
  • @Muaad - I can't see the code there, but you could use the same regex and use the _Dot-All_ modifier. I.e. put `(?s)` at it's beginning. Also, put some delimiters in the replacement `$1\r\n-----------\r\n`. –  Mar 03 '18 at 20:41
  • @Muaad - I've updated the regex to span line breaks. –  Mar 03 '18 at 21:31
  • Honestly, I have no clue how this regex works. Haven't done much regex. Am getting a `SyntaxError undefined group option:` error with your update. Any idea? – Muaad Mar 03 '18 at 21:58
  • @Muaad - `(?XX)` where XX are inline modifiers. Take it out and add the _dot_all_ modifier (usually `s`) as a function option: either `/regex/gs` or the instance option new regex("regex", "options"). I'm not sure how Ruby works. –  Mar 03 '18 at 22:14
0

This is what worked for me (thanks to @StefanPochmann's comments):

text = "Some really long string\nwith some line breaks"

The following will first remove all whitespace before breaking the string up.

text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)

The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:

text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")

After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:

chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}

Looks like a long process but it worked for me.

Muaad
  • 366
  • 2
  • 8
  • 1
    Consider using [String#squeeze](http://ruby-doc.org/core-2.4.0/String.html#method-i-squeeze) rather than `gsub`. Also, it might be safer to make the placeholder a non-printing character and make it a constant. For example, `PLACEHOLDER = 0.chr`. – Cary Swoveland Mar 09 '18 at 02:56
  • @CarySwoveland Am not sure how #squeeze can be used instead of #gsub in this case. Could you elaborate some more? – Muaad Oct 09 '18 at 20:03
  • Since all newline characters have been removed when you execute `text.gsub(/\s+/, ' ')`, that's the same as `text.squeeze(' ')` if spaces are the only remaining whitespace characters. That won't work, however, if, for example, the text contains tabs that you wish to remove. Incidentally, the question does not say that extra spaces are to be removed. – Cary Swoveland Oct 09 '18 at 20:26
  • OK. Makes sense now. Thanks. – Muaad Oct 10 '18 at 10:40