10

Say I have a string:

teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 

That I would like as:

testlist = ["1.3 Hello how are you", "1.4 I am fine, thanks 1.2 Hi There", "1.5 Great!"]

Basically, splitting only on increasing digits where the difference is .1 (i.e. 1.2 to 1.3).

Is there a way to split this with regex but only capturing increasing sequential numbers? I wrote code in python to sequentially iterate through using a custom re.compile() for each one and it is okay but extremely unwieldy.

Something like this (where parts1_temp is a given list of the x.x. numbers in the string):

parts1_temp = ['1.3','1.4','1.2','1.5']
parts_num =  range(int(parts1_temp.split('.')[1]), int(parts1_temp.split('.')[1])+30)
parts_search = ['.'.join([parts1_temp.split('.')[0], str(parts_num_el)]) for parts_num_el in parts_num]
#parts_search should be ['1.3','1.4','1.5',...,'1.32']

for k in range(len(parts_search)-1):
    rxtemp = re.compile(r"(?:"+str(parts_search[k])+")([\s\S]*?)(?=(?:"+str(parts_search[k+1])+"))", re.MULTILINE)
    parts_fin = [match.group(0) for match in rxtemp.finditer(teststring)]

But man is it ugly. Is there a way to do this more directly in regex? I imagine this is feature that someone would have wanted at some point with regex but I can't find any ideas on how to tackle this (and maybe it is not possible with pure regex).

sfortney
  • 2,075
  • 6
  • 23
  • 43
  • 2
    `But man is it ugly` yep... It'll be uglier with a single regex too! – ctwheels Feb 16 '18 at 22:04
  • Haha maybe. I'm very very far from a regex expert so I don't know – sfortney Feb 16 '18 at 22:11
  • What about the scenario "1.3 ..... 1.4 ..... 1.2 ...... 1.3....." Would you match the second instance of 1.3?. – Preston Martin Feb 16 '18 at 22:13
  • You can do that with Perl. I doubt you can do it with Python `re`/`regex` (at least "nicely"). – Wiktor Stribiżew Feb 16 '18 at 22:19
  • 2
    I suggest two steps: (1) overgenerate with RegEx, (2) postprocess to fix errors. Ie. you split before each x.x occurrence (with a RegEx containing `\d\.\d`), then check pairs of neighboring parts to reattach what was erroneously split apart. – lenz Feb 16 '18 at 22:32
  • PrestonM no I would not want it to match that – sfortney Feb 16 '18 at 22:55
  • Later, you may find this [interesting for study](https://stackoverflow.com/questions/39306590/match-list-of-incrementing-integers-using-regex). – revo Feb 16 '18 at 23:52
  • Just split on every `\d\.\d` number, then test if it's increasing and if not, push the concatenated string+number+next_string back on the parser. (Ultimately, why is this text formatted in this weird way, and what are you trying to achieve after you split it?) – smci Feb 17 '18 at 12:07

3 Answers3

3

Doing this with a regex only seems overly complex. What about this processing:

import re

teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 
res = []
expected = None
for s in re.findall(r'\d+(?:\.\d+)?|\D+', teststring):
    if s[0].isdigit() and expected is None:
        expected = s
        fmt = '{0:.' + str(max(0, len(s) - (s+'.').find('.') - 1)) + 'f}'
        inc = float(re.sub(r'\d', '0', s)[0:-1] + '1')
    if s == expected:
        res.append(s)
        expected = fmt.format(float(s) + inc)
    elif expected:
        res[-1] = res[-1] + s

print (res)

This also works if the numbers happen to have 2 decimals or more, or none.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Just FYI this doesn't seem to work going from say 8.9 to 8.10. I could be wrong but I think that is correct – sfortney Feb 19 '18 at 22:15
  • It would go to from 8.9 to 9.0, which I assumed was what was expected (you wrote: *"where the difference is .1"*). If it has to be different the logic would have to change a bit. But anyway, I suppose you already have your answer, since you accepted one ;-) – trincot Feb 19 '18 at 22:17
2

This method uses finditer to find all locations of \d+\.\d+, then tests whether the match was numerically greater than the previous. If the test is true it appends the index to the indices array.

The last line uses list comprehension as taken from this answer to split the string on those given indices.

Original Method

This method ensures the previous match is smaller than the current one. This doesn't work sequentially, instead, it works based on number size. So assuming a string has the numbers 1.1, 1.2, 1.4, it would split on each occurrence since each number is larger than the last.

See code in use here

import re

indices = []
string =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0

for m in regex.finditer(string):
    x = float(m.group())
    if lastFloat < x:
        lastFloat = x
        indices.append(m.start(0))

print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])

Outputs: ['1.3 Hello how are you ', '1.4 I am fine, thanks 1.2 Hi There ', '1.5 Great!']


Edit

Sequential Method

This method is very similar to the original, however, on the case of 1.1, 1.2, 1.4, it wouldn't split on 1.4 since it doesn't follow sequentially given the .1 sequential separator.

The method below only differs in the if statement, so this logic is fairly customizable to whatever your needs may be.

See code in use here

import re

indices = []
string =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0

for m in regex.finditer(string):
    x = float(m.group())
    if (lastFloat == 0) or (x == round(lastFloat + .1, 1)):
        lastFloat = x
        indices.append(m.start(0))

print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Why the downvote? The output is what the OP expects. – ctwheels Feb 16 '18 at 22:35
  • @sfortney it's not splitting on every `x.x` it's splitting on `x.x` if `x.x` is larger than the last – ctwheels Feb 16 '18 at 22:44
  • My bad. Changed my vote sorry. Actually I think this might be a neat way to do it. I was hoping for a slightly more regex-y solution though. – sfortney Feb 16 '18 at 22:48
  • Oh also this doesn't only catch iterations by 1 which is something I asked for in the question – sfortney Feb 16 '18 at 22:50
  • 1
    @sfortney Oh I get it now! Give me a couple of minutes to fix it. Sorry, didn't see you want only increasing of 0.1 – ctwheels Feb 16 '18 at 22:51
  • 1
    @sfortney please see my edit. I've altered this to work sequentially. I've kept the original in case it helps future readers. – ctwheels Feb 16 '18 at 23:00
  • Yeah that's probably about the best you can do I think. I was really hoping there was a secret regex command I didn't know about but that doesn't seem to be the case. :) – sfortney Feb 16 '18 at 23:06
  • 1
    @sfortney unfortunately regex doesn’t know what’s sequential. It’s a hack to get sequences working in regex alone. See [my answer](https://stackoverflow.com/a/48589051/3600709) for matching sequential alphabetic letters in regex alone – ctwheels Feb 16 '18 at 23:08
  • Hey ctwheels one more quick question, how would this change if the string was "Section 1.3 Hello how are you Section 1.4 I am fine, thanks Section 1.2 Hi There Section 1.5 Great!" and the desired output, ["Section 1.3 Hello how are you", "Section 1.4 I am fine, thanks Section 1.2 Hi There", "Section 1.5 Great!"] – sfortney Feb 20 '18 at 16:16
  • 1
    @sfortney you would change the regex to `Section (\d+\.\d+)` and set `x = float(m.group(1))` instead of `m.group()` as seen [here](https://tio.run/##TZCxbsMgFEXn8BW3nkCxUNw0S6Subbp06uZ4QMmzTWoDwkRN@vPuo8mQBT3g3sMR4Zp679bzbMfgY0IkIaw72gNNeEXdiClF6zqeUVR6jR0Ng0fvf2Ai4erPqPQLPmBGtNZRidQb9z3x6TN2Fl89cazSG7xHMumpEJE6ujAukj74MdiBZCz2x@Ve81IoMZgpvQ3eJM6shGh9xAjr8N/TbXZLFOVNS23FIsPaXJCj7qI/B6mUWNgW8gHFLAVGyUueOeWOD9dL6KpEpTJu8Shw4f39N7QJgbg16imZmOSKXxGBJZKsbzK13Z4aZGFbnrLyrw3y3i5xH@pq2yzrT@@oUQ0T5vkP) – ctwheels Feb 20 '18 at 16:18
  • Never mind your solution does work. I was just putting it into a function and getting some weirdness from that. Thanks! – sfortney Feb 20 '18 at 16:46
2

You can also mutate the string so that a marker is placed next to the digit if it is part of the increasing sequence. Then, you can split at that marker:

import re
teststring =  "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!" 
numbers = re.findall('[\.\d]+', teststring)
final_string = re.sub('[\.\d]+', '{}', teststring).format(*[numbers[0]]+[numbers[i] if numbers[i] < numbers[i-1] else '*'+numbers[i] for i in range(1, len(numbers))]).split(' *')

Output:

['1.3 Hello how are you', '1.4 I am fine, thanks 1.2 Hi There', '1.5 Great!']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102