101

I have been looking for an elegant and efficient way to chunk a string into substrings of a given length in Ruby.

So far, the best I could come up with is this:

def chunk(string, size)
  (0..(string.length-1)/size).map{|i|string[i*size,size]}
end

>> chunk("abcdef",3)
=> ["abc", "def"]
>> chunk("abcde",3)
=> ["abc", "de"]
>> chunk("abc",3)
=> ["abc"]
>> chunk("ab",3)
=> ["ab"]
>> chunk("",3)
=> []

You might want chunk("", n) to return [""] instead of []. If so, just add this as the first line of the method:

return [""] if string.empty?

Would you recommend any better solution?

Edit

Thanks to Jeremy Ruten for this elegant and efficient solution: [edit: NOT efficient!]

def chunk(string, size)
    string.scan(/.{1,#{size}}/)
end

Edit

The string.scan solution takes about 60 seconds to chop 512k into 1k chunks 10000 times, compared with the original slice-based solution which only takes 2.4 seconds.

android.weasel
  • 3,343
  • 1
  • 30
  • 41
MiniQuark
  • 46,633
  • 36
  • 147
  • 183
  • 1
    Your original solution is about as efficient and elegant as possible: there's no need to inspect each character of the string to know where to chop it, nor any need to turn the whole thing into an array and then back again. – android.weasel Jun 11 '19 at 10:23

10 Answers10

176

Use String#scan:

>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{4}/)
=> ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx"]
>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{1,4}/)
=> ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]
>> 'abcdefghijklmnopqrstuvwxyz'.scan(/.{1,3}/)
=> ["abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yz"]
Paige Ruten
  • 172,675
  • 36
  • 177
  • 197
24

Here is another way to do it:

"abcdefghijklmnopqrstuvwxyz".chars.to_a.each_slice(3).to_a.map {|s| s.to_s }

Or,

"abcdefghijklmnopqrstuvwxyz".chars.each_slice(3).map(&:join)

Either:

=> ["abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yz"]
dawg
  • 98,345
  • 23
  • 131
  • 206
Jason
  • 2,341
  • 17
  • 14
  • 25
    Alternatively: `"abcdefghijklmnopqrstuvwxyz".chars.each_slice(3).map(&:join)` – Finbarr Nov 17 '12 at 00:52
  • 3
    I like this one because it works on strings that contain newlines. – Steve Davis Aug 16 '13 at 15:12
  • 1
    This should be the accepted solution. Using scan might drop last token if length won't match _pattern_. – count0 Oct 26 '16 at 20:56
  • Finbarr's alternative returned the output in this answer for me (one array with 9 string objects, max length 3). The code in the answer itself is returning 8 arrays of 3 letters each and a final one with two: `["y", "z"]`. I'm on Ruby 3.0.1, fwiw. – Tyler James Young Dec 14 '21 at 04:37
6

I think this is the most efficient solution if you know your string is a multiple of chunk size

def chunk(string, size)
    (string.length / size).times.collect { |i| string[i * size, size] }
end

and for parts

def parts(string, count)
    size = string.length / count
    count.times.collect { |i| string[i * size, size] }
end
davispuh
  • 1,419
  • 3
  • 18
  • 30
  • 4
    Your string doesn't have to be a multiple of chunk size if you replace `string.length / size` with `(string.length + size - 1) / size` -- this pattern is common in C code that has to deal with integer truncation. – nitrogen Aug 19 '15 at 02:25
6

I made a little test that chops about 593MB data into 18991 32KB pieces. Your slice+map version ran for at least 15 minutes using 100% CPU before I pressed ctrl+C. This version using String#unpack finished in 3.6 seconds:

def chunk(string, size)
  string.unpack("a#{size}" * (string.size/size.to_f).ceil)
end
Per Wigren
  • 61
  • 1
  • 1
  • How would you recommend handling UTF8 strings? (the "a" specifier in unpack doesn't seem to work very well with UTF8) – user1070300 Feb 04 '22 at 18:15
4

Here is another one solution for slightly different case, when processing large strings and there is no need to store all chunks at a time. In this way it stores single chunk at a time and performs much faster than slicing strings:

io = StringIO.new(string)
until io.eof?
  chunk = io.read(chunk_size)
  do_something(chunk)
end
prcu
  • 903
  • 1
  • 10
  • 23
  • For very large strings, this is _by far_ the **best way to do it**. This will avoid reading the entire string into memory and getting `Errno::EINVAL` errors like `Invalid argument @ io_fread` and `Invalid argument @ io_write`. – Joshua Pinter Oct 25 '20 at 17:42
1

A better solution which takes into account the last part of the string which could be less than the chunk size:

def chunk(inStr, sz)  
  return [inStr] if inStr.length < sz  
  m = inStr.length % sz # this is the last part of the string
  partial = (inStr.length / sz).times.collect { |i| inStr[i * sz, sz] }
  partial << inStr[-m..-1] if (m % sz != 0) # add the last part 
  partial
end
3limin4t0r
  • 19,353
  • 2
  • 31
  • 52
kirkytullins
  • 143
  • 5
1
test.split(/(...)/).reject {|v| v.empty?}

The reject is necessary because it otherwise includes the blank space between sets. My regex-fu isn't quite up to seeing how to fix that right off the top of my head.

Chuck
  • 234,037
  • 30
  • 302
  • 389
  • the scan aproach will forget about non matched caracteres, ie: if u try with a 10 length string slice on 3 parts, you will have 3 parts and 1 element will be dropped, your aproach don't do that, so its best. – vinicius gati Jan 24 '14 at 18:52
0

Just text.scan(/.{1,4}/m) resolves the problem

Vyacheslav
  • 26,359
  • 19
  • 112
  • 194
0

I personally followed the idea of user8556428, to avoid the costly intermediate values that most proposals introduce, and to avoid modifying the input string. And I want to be able to use it as a generator (for instance to use s.each_slice.with_index).

My use case is really about bytes, not characters. In the case of character-size, strscan is a great solution.

class String
    # Slices of fixed byte-length.  May cut multi-byte characters.
    def each_slice(n = 1000, &block)
        return if self.empty?

        if block_given?
            last = (self.length - 1) / n
            (0 .. last).each do |i|
                yield self.slice(i * n, n)
            end
        else
            enum_for(__method__, n)
        end
    end
end


p "abcdef".each_slice(3).to_a # => ["abc", "def"]   
p "abcde".each_slice(3).to_a  # => ["abc", "de"]    
p "abc".each_slice(3).to_a    # => ["abc"]          
p "ab".each_slice(3).to_a     # => ["ab"]           
p "".each_slice(3).to_a       # => []               
akim
  • 8,255
  • 3
  • 44
  • 60
0

Are there some other constraints you have in mind? Otherwise I'd be awfully tempted to do something simple like

[0..10].each {
   str[(i*w),w]
}
Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • I don't really have any constraint, apart from having something simple, elegant and efficient. I like your idea, but would you mind translating it into a method please? The [0..10] would probably become slightly more complex. – MiniQuark Apr 16 '09 at 01:37
  • I fixed my example to use str[i*w,w] instead of str[i*w...(i+1)*w]. Tx – MiniQuark Apr 16 '09 at 01:44
  • This should be (1..10).collect rather than [0..10].each. [1..10] is an array consisting of one element -- a range. (1..10) is the range itself. And +each+ returns the original collection that it's called on ([1..10] in this case) rather than the values returned by the block. We want +map+ here. – Chuck Apr 16 '09 at 05:25