1

I am trying to break up a string with multiple lists inside them with different formatting. What is the best way to do this?

string = "something here: 1) A i) great ii) awesome 2) B"

another_string = "But sometimes it is different (1) yep (2) not the same i. or this ii. another bullet (3.1) getting difficult huh? 3.1.1 okay i'm done"

Ideally, I would want to be able to split any possible numbering or bullet list.

Desired output for string:

something here: 1) A 
i) great 
ii) awesome 
2) B

Desired output for another_string:

But sometimes it is different (1) yep
(2) not the same
i. or this 
ii. another bullet
(3.1) getting difficult huh?
3.1.1 okay i'm done
echan00
  • 2,788
  • 2
  • 18
  • 35
  • 3
    What is your desired output? – Ajax1234 Sep 28 '18 at 03:27
  • @Ajax1234 just revised my question – echan00 Sep 28 '18 at 03:30
  • 1
    Ok sure, you could theoretically split with regex on numbers... however to make the code more general how would we handle the fact text could contain numbers? For instance: `(3.1)` 2.4 meters – Anton vBR Sep 28 '18 at 03:38
  • @AntonvBR I suppose 2.4 will also get cut off as another substring of the string. Not sure there is another way around it. – echan00 Sep 28 '18 at 03:44
  • @echan00 Yeah, but... the question here is. What is it your are trying to do? Do you want to validate the splits before output? Could possibly build a program that either splits or appends. – Anton vBR Sep 28 '18 at 03:45
  • @AntonvBR I am trying to parse phrases/sentences out of a large number of documents. I'm using NLTK to parse sentences, but I see that many sentences have run on and included the numbered lists. I'm hoping to break those sentences into multiple pieces. – echan00 Sep 28 '18 at 03:54
  • Ok, a googled quickly and think you can start by looking here: https://stackoverflow.com/questions/46331543/use-regex-to-split-numbered-list-array-into-numbered-list-multiline – Anton vBR Sep 28 '18 at 04:00

1 Answers1

1

You can use re.split with the following regex (with the roman numeral regex borrowed from paxdiablo) to split the input string, and then join them with an iterator:

import re
def split(s):
    i = iter(re.split(r'(\(?\d+(?:\.\d+)+\)?|\(?\d+\)|\(?\b(?=M|(?:CM|CD|D?C)|(?:XC|XL|L?X)|(?:IX|IV|V?I))M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})[.)])', s, flags=re.IGNORECASE))
    return next(i) + '\n'.join(map(''.join, zip(i, i)))

so that with your sample inputs:

split(string)

would return:

something here: 1) A 
i) great 
ii) awesome 
2) B

and:

split(another_string)

would return:

But sometimes it is different (1) yep 
(2) not the same 
i. or this 
ii. another bullet 
(3.1) getting difficult huh? 
3.1.1 okay i'm done
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Huh, big regex there, (makes me dizzy), :-), at least it still works:-) – U13-Forward Sep 28 '18 at 04:15
  • This is awesome, but I'm noticing a few other problems.. Lists represented as (a), (b), (c), (A), (B), (C) are getting cut off. Abbreviations like Mr. and Mrs. are also getting cut off. Website urls such as www.google.com, www.sfc.com.hk also getting cut. – echan00 Sep 28 '18 at 05:11
  • @echan00 I see. I've updated my answer with a fix. Please try it again. – blhsing Sep 28 '18 at 05:38
  • I see a weird case where (i) wasn't split up and (ii) was split as '(' and 'ii)'. I also see a case where "(Cap." and "615)" was split up but not "(Cap. 486)". Any idea why? – echan00 Sep 28 '18 at 06:12
  • @blhsing you've been awesome regardless, will mark answer as solved – echan00 Sep 28 '18 at 06:13
  • You're welcome. Well you never mentioned that `(i)` and `(ii)` are valid separators. Updated my answer to account for them then. As for `486)`, it's a valid separator because it's in the same category as `1)`, unless you can define a clearer rule to exclude `486)` from being considered a separator. – blhsing Sep 28 '18 at 06:26
  • @blhsing what would you change to have the first bullet after the colon also split separately on its own? – echan00 Dec 03 '18 at 09:58