How to split a list of text up into sections based on the length of the elements

Question

Given the following:

# from : https://stackoverflow.com/company
s = """
Founded in 2008, Stack Overflow’s public platform is used by nearly everyone who codes comes to learn, share their knowledge, collaborate, and build their careers.

Our products and tools help developers and technologists in life and at work. These products include Stack Overflow for Teams, Stack Overflow Advertising, and Stack Overflow for Talent and Jobs.

Stack Overflow for Teams, our core SaaS collaboration product, is helping thousands of companies around the world as the transition to remote work, address business continuity challenges, and undergo digital transformation.

Whether it’s on Stack Overflow or within Stack Overflow for Teams, community is at the center of all that we do.
"""

list_to_split = s.split()

I'd like to split it into lists where each list has at most m characters.

Here's an attempt:

# max list char count
m = 50

count = 0
indices = [0]
for i, el in enumerate(list_to_split):
    count += len(str(el))
    if count >= max_chars:
        indices.append(i)
        count = 0

split_lists = [list_to_split[s[0] : s[1]] for s in zip(indices[:-1], indices[1:])]
# check
flat = list(itertools.chain.from_iterable(split_lists))
flat == list_to_split

This returns False.

The elements which aren't in flat but are in list_to_split are

['that', 'we', 'do.']

Ordering is very important for this (it should be in the same order as given) - and no data can be lost.

edit

whoever flagged this as a duplicate of this (Split string every nth character?) clearly didn't read both.

Does this answer your question? [Split string every nth character?](https://stackoverflow.com/questions/9475241/split-string-every-nth-character) — PacketLoss, Sep 28 '20 at 00:15
Also from @PacketLoss question: do you need the split to respect words e.g. if 50th character is the middle of a word, should the word go to the next element on the list? — Sia, Sep 28 '20 at 00:22
@Sia if a word breaks the limit then it should be put into the next list — baxx, Sep 28 '20 at 09:09

Paddy3118 · Answer 1 · 2020-09-28T09:32:27.810

This will work as long as m is as large as the longest word.

In [38]: def splitter(lst, m):
    ...:     tot, this = [], []
    ...:     for wrd in lst:
    ...:         if len(''.join(this)) + len(wrd) <= m:
    ...:             this.append(wrd)
    ...:         else:
    ...:             tot.append(this)
    ...:             this = [wrd]
    ...:     tot.append(this)
    ...:     return tot

In [39]: splitter(list_to_split, 40)
Out[39]: 
[['Founded', 'in', '2008,', 'Stack', 'Overflow’s', 'public'],
 ['platform', 'is', 'used', 'by', 'nearly', 'everyone', 'who', 'codes'],
 ['comes', 'to', 'learn,', 'share', 'their', 'knowledge,'],
 ['collaborate,', 'and', 'build', 'their', 'careers.', 'Our'],
 ['products', 'and', 'tools', 'help', 'developers', 'and'],
 ['technologists', 'in', 'life', 'and', 'at', 'work.', 'These'],
 ['products', 'include', 'Stack', 'Overflow', 'for', 'Teams,'],
 ['Stack', 'Overflow', 'Advertising,', 'and', 'Stack'],
 ['Overflow', 'for', 'Talent', 'and', 'Jobs.', 'Stack', 'Overflow'],
 ['for', 'Teams,', 'our', 'core', 'SaaS', 'collaboration'],
 ['product,', 'is', 'helping', 'thousands', 'of', 'companies'],
 ['around', 'the', 'world', 'as', 'the', 'transition', 'to', 'remote'],
 ['work,', 'address', 'business', 'continuity'],
 ['challenges,', 'and', 'undergo', 'digital'],
 ['transformation.', 'Whether', 'it’s', 'on', 'Stack'],
 ['Overflow', 'or', 'within', 'Stack', 'Overflow', 'for', 'Teams,'],
 ['community', 'is', 'at', 'the', 'center', 'of', 'all', 'that', 'we', 'do.']]

In [40]: len(list_to_split)
Out[40]: 107

In [41]: sum(len(x) for x in splitter(list_to_split, 40))
Out[41]: 107

In [42]:

I split on 40 but 50 can be used, but then the auto-formatting of the output I get is rather long, but still correct.

Paddy3118 · Answer 2 · 2020-09-29T22:09:39.380

Sometimes I think of splitting as grouping. The following is written to use groupby and needs a class to create a key function to hold the state of the length of words so far.

from itertools import groupby
from pprint import pprint as pp

class Grouper():
    def __init__(self, m):
        self.m = m
        self.glen = 0           # group length so far
        self.group_num = 0
    
    def __call__(self, wrd):
        wlen = len(wrd)
        self.glen += wlen
        if self.glen > self.m:
            self.glen = wlen
            self.group_num += 1
        return self.group_num
            
def splitter2(lst, m):
    return [list(y) for x, y in groupby(list_to_split, 
                                        Grouper(m))]


if __name__ == '__main__':
    
    s = """
    Founded in 2008, Stack Overflow’s public platform is used by nearly everyone who codes comes to learn, share their knowledge, collaborate, and build their careers.
    
    Our products and tools help developers and technologists in life and at work. These products include Stack Overflow for Teams, Stack Overflow Advertising, and Stack Overflow for Talent and Jobs.
    
    Stack Overflow for Teams, our core SaaS collaboration product, is helping thousands of companies around the world as the transition to remote work, address business continuity challenges, and undergo digital transformation.
    
    Whether it’s on Stack Overflow or within Stack Overflow for Teams, community is at the center of all that we do.
    """
    
    list_to_split = s.split()
    
    split = splitter2(list_to_split, 40)
    assert len(list_to_split) == sum(len(x) for x in split)
    pp(split)

Output:

[['Founded', 'in', '2008,', 'Stack', 'Overflow’s', 'public'],
 ['platform', 'is', 'used', 'by', 'nearly', 'everyone', 'who', 'codes'],
 ['comes', 'to', 'learn,', 'share', 'their', 'knowledge,'],
 ['collaborate,', 'and', 'build', 'their', 'careers.', 'Our'],
 ['products', 'and', 'tools', 'help', 'developers', 'and'],
 ['technologists', 'in', 'life', 'and', 'at', 'work.', 'These'],
 ['products', 'include', 'Stack', 'Overflow', 'for', 'Teams,'],
 ['Stack', 'Overflow', 'Advertising,', 'and', 'Stack'],
 ['Overflow', 'for', 'Talent', 'and', 'Jobs.', 'Stack', 'Overflow'],
 ['for', 'Teams,', 'our', 'core', 'SaaS', 'collaboration'],
 ['product,', 'is', 'helping', 'thousands', 'of', 'companies'],
 ['around', 'the', 'world', 'as', 'the', 'transition', 'to', 'remote'],
 ['work,', 'address', 'business', 'continuity'],
 ['challenges,', 'and', 'undergo', 'digital'],
 ['transformation.', 'Whether', 'it’s', 'on', 'Stack'],
 ['Overflow', 'or', 'within', 'Stack', 'Overflow', 'for', 'Teams,'],
 ['community', 'is', 'at', 'the', 'center', 'of', 'all', 'that', 'we', 'do.']]

How to split a list of text up into sections based on the length of the elements

edit

2 Answers2