0

I am trying to write a code that would loop through elements in a list of strings and combine the elements that start with a lower case letter with a previous element. For example, given this list:

test_list = ['Example','This is a sample','sentence','created to illustrate','the problem.','End of example']

I would like to end up with the following list:

test_list = ['Example','This is a sample sentence created to illustrate the problem.','End of example']

Here is the code I have tried (which doesn't work):

for i in range(len(test_list)):
    if test_list[i].islower():
        test_list[i-1:i] = [' '.join(test_list[i-1:i])]

I think there might be a problem with me trying to use this join recursively. Could someone recommend a way to solve this? As background, the reason I need this is because I have many PDF documents of varying sizes converted to text which I split into paragraphs to extract specific items using re.split('\n\s*\n',document) on each doc. It works for most docs but, for whatever reason, some of them have '\n\n' literally after every other word or just in random places that do not correspond to end of paragraph, so I am trying to combine these to achieve a more reasonable list of paragraphs. On the other hand, if anyone has a better idea of how to split raw extracted text into paragraphs, that would be awesome, too. Thanks in advance for the help!

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Mirasok
  • 13
  • 2
  • I've marked one of several duplicates it would take to cover this problem -- I think you have the others in hand. Make a new list; start a new element for each string with upper-case, appending the lower-case ones as you go. Solve that problem with a nested loop, if need be, but *solve* it before you try for the cascaded and concatenated one-liner. – Prune Mar 24 '20 at 21:25

1 Answers1

1

you could use:

output = [test_list[0]]
for a, b in zip(test_list, test_list[1:]):
    if b[0].islower():
        output[-1]  = f'{output[-1]} {b}'
    else:
        output.append(b)
output

output:

['Example',
 'This is a sample sentence created to illustrate the problem.',
 'End of example']
kederrac
  • 16,819
  • 6
  • 32
  • 55