0

I want to be able to split the following string:

"This is a string with an embedded list.  1. My first list item.  2. My second item.  a. My first sub-item.  b. My second sub-item.  3. My last list item."

I would like to split it as:

"This is a string with an embedded list."
"1. My first list item."
"2. My second item."
"a. My first sub-item."
"b. My second sub-item."
"3. My last list item."

I cannot guarantee that each embedded list item will always have two spaces preceding it but it will have at least one or it will start the string. Also, I cannot guarantee that the first word in an embedded list will always be capitalized. Lastly, the numbered and lettered portion inside the string could go into the teens in terms of numbers so it is possible to get an entry starting with say "10. ". If there is no embedded list, I would like this to just return the original string, no splitting required.

In terms of rules to identify an embedded list item, here are some of my thoughts:

  1. It will always have some amount of whitespace in front of it, one or more spaces, or it might start the string.
  2. After the whitespace or start of string, it will have 1 to 2 digits followed by a period or a single character followed by a period. The character may or may not be capitalized.

While this is not an exhaustive set of conditions, I think it will find a good amount of embedded lists.

  • this may [help](https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences) – sahasrara62 May 27 '20 at 00:27
  • As a person, how do you identify the list items? Is it something like "any short (<2 character) string followed by a period? That's assuming none of the list items will end in a two-letter word. Or is it perhaps single characters even? You need to think of a rule that's always true to identify a list item before thinking about turning it into a regex. – Grismar May 27 '20 at 00:28
  • can you `split()` it with some extra rules or do you NEED regex? – bherbruck May 27 '20 at 00:31
  • @ Tenacious B, I am open to using split() if that can solve this. – Bruce Walthers May 27 '20 at 00:38
  • @ Grismar, I have added some thoughts on how to identify an embedded list item. It will not be exhaustive, but will do a fair job finding what I am looking for. – Bruce Walthers May 27 '20 at 00:48

1 Answers1

1

You could split using this regex, which looks for some number of spaces followed by either digits and a period or a letter and a period:

\s+(?=(?:\d+|[a-z])\.)

In python (note use of re.I flag to match upper and lower case letters):

import re

s = "This is a string with an embedded list.  1. My first list item.  2. My second item.  a. My first sub-item.  b. My second sub-item.  3. My last list item."

print(re.split(r'\s+(?=(?:\d+|[a-z])\.)', s, 0, re.I))

Output:

[
 'This is a string with an embedded list.',
 '1. My first list item.',
 '2. My second item.',
 'a. My first sub-item.',
 'b. My second sub-item.',
 '3. My last list item.'
]
Nick
  • 138,499
  • 22
  • 57
  • 95