Splitting a string (case sensitive)

Question

I have been facing an issue with splitting a string.

Here are the cases:

test_str = "4years3months5days" -- output=4.3
test_str = "3 months 2 days" -- output=0.3
test_str = "4 years 2 months" -- output=4.2
test_str = "4 Years 3 Months" -- output=4.3
test_str = "4.6" -- output=4.6
test_str = "4Y3M" -- output=4.3

(case sensitive here in few cases)

code:

test_string = "3months5days"

print("length=",len(test_string))
# printing original string  
print("The original string : " + test_string) 

if type(test_string) is bool:
    print(0)
elif len(test_string) == 1 or len(test_string) == 3:
    test_string = float(test_string)
    print("converted=",test_string)

else:
    temp = re.findall(r'\d+', test_string) 
    res = list(map(int, temp)) 
    print(res)
    if len(res)==1:
        print(float(res[0]))
    else:
        print(str(res[0])+'.'+str(res[1]))

I am able to write the code (or get from internet as well) for individual cases but not when combined. Any help?

if you looking to to extract integers from str this post covers it pretty well https://stackoverflow.com/questions/4289331/how-to-extract-numbers-from-a-string-in-python — lww, Dec 17 '20 at 14:41
But it is not working for second case. For example if I am expecting an output 0.3, how can I list it with the first case code — RSK Rao, Dec 17 '20 at 14:46
I see, but you didn't provide the code you are using, so it is impossible to say why it is not working. I think you should post the code as well — lww, Dec 17 '20 at 15:05
Is it possible to get less complete input? What if you get a single integer, like `'3'`? Should it assume that's a year? A day? — CrazyChucky, Dec 17 '20 at 16:37

Sagun Devkota · Answer 1 · 2020-12-17T16:15:14.860

0

import re
result = re.match(pattern, string)

You can create different patterns and use regular expression.

regex = r"(\d+)years(\d+)months(\d+)days"
match = re.search(regex, "4years3months5days")
print(match)
if match != None:
 print("Match at index % s, % s" % (match.start(), match.end()))
 print(match.group(0),match.group(1),match.group(2),match.group(3))

If you don't know about regular expression then follow the link for documentationRegular Expression

edited Dec 17 '20 at 16:15

answered Dec 17 '20 at 14:53

Sagun Devkota

495
3
10

I tried this....it works for first case but not with second case. temp = re.findall(r'\d+', test_string) res = list(map(int, temp)) If I extract into list, the second case gives [3,4] and the output is 3.4 which is incorrect. There I want [0,3,4]. Is it possible? How can I differentiate the result is 0 years 3 months and 4 days? – RSK Rao Dec 17 '20 at 14:57
@Tomerikoo I have edited my answer check it – Sagun Devkota Dec 17 '20 at 15:40
Better. But I'm not sure you understand what `[years]+` means. It will match 1 or more of any of the letters `years`. So for example `yysea` will match as well. You can use https://regex101.com/ to test out different patterns with many inputs – Tomerikoo Dec 17 '20 at 15:57
@Tomerikoo Yeah it will match with strings like years and yysea and months and monntthhs but for this question we may not get these cases as suggested by question. That's why I used this code. – Sagun Devkota Dec 17 '20 at 16:07
What I mean is that there is no need for a character group in that case... Just `(\d+)years` will do... – Tomerikoo Dec 17 '20 at 16:08
@Tomerikoo Thanks dude has my edit corrected the issue? – Sagun Devkota Dec 17 '20 at 16:16

score 0 · Answer 2 · answered Dec 17 '20 at 15:11

0

Basically, you need to find the substrings "years" and "months" and catch the substring that comes before (or between) them.

So here is a "naive" approach based on substring search and without regular expressions:

years = test_str.lower().find("years")
if years == -1:
    result = "0."
else:
    result = test_str[:years].strip() + "."

months = test_str.lower().find("months")
if months == -1:
    result += "0"
else:
    if years != -1:
        result += test_str[years+5:months].strip()
    else:
        result += test_str[:months].strip()

answered Dec 17 '20 at 15:11

Gerd

2,568
1
7
20

This is working fine but what if I give 10Y3M or 10Y 3M? It throws an error...and if my string is "10" or "10.6" (just test_string = "10" or test_string="10.6") – RSK Rao Dec 17 '20 at 15:27
I just added this code for int and float test_string = float(test_string) print("test_string=",test_string) it is working..but didn't get for 10Y or 10Y3M... – RSK Rao Dec 17 '20 at 15:51

CrazyChucky · Answer 3 · 2020-12-21T16:28:05.730

Here's a flexible version that passes all your test cases, using regular expressions. First, I'll define and compile* the regular expressions:

import re

# This pattern checks for the numeric version of your input: one or more
# digits, followed by a period, and then one or more digits.
numeric_pattern = re.compile(r'\d+\.\d+$')

# This one looks for two optional groups: each is one or more digits
# followed by 'y' or 'years', or 'm' or 'months'. The capturing groups
# are named, so we can tell which is which even if we only find one.
word_pattern = re.compile(
    # Written on two lines for clarity, but Python automatically
    # combines string literals inside parentheses:
    r'(?:(?P<years>\d+)y(?:ears)?)?'
    r'(?:(?P<months>\d+)m(?:onths)?)?'
)

Then I define a function to check these patterns against a supplied string:

def get_year_month(string):
    # Rather than deal with spaces and capitalization in our regexes,
    # we can normalize the input string first.
    string = string.lower().replace(' ', '') 
    
    # Check for the simpler case first. If it's a match, return as-is.
    if numeric_pattern.match(string):
        return string
    
    # Otherwise, check for words. (This pattern will ALWAYS match,
    # because each half is an optional group.)
    match = word_pattern.match(string)

    # Whatever it doesn't find is set to 0.
    years = match.group('years') if match.group('years') else 0
    months = match.group('months') if match.group('months') else 0
    return f'{years}.{months}'

Looping over a list of your inputs and expected outputs is a simple way to verify if it's working. It doesn't throw an error, so we know all the tests pass.

tests = [
    ('4years3months5days', '4.3'),
    ('3 months 2 days', '0.3'),
    ('4 years 2 months', '4.2'),
    ('4 Years 3 Months', '4.3'),
    ('4.6', '4.6'),
    ('4Y3M', '4.3'),
]

for string, result in tests:
    assert(get_year_month(string) == result)

*Even though Python caches regexes, I've found that, when your regexes will be reused multiple times, compiling them can still be dramatically faster, for some reason, even when the number of regexes isn't anywhere near maxing out the cache limit.

Regardless of performance, defining your regexes all in one place and giving them clear names can often make your code clearer and more readable.

[Is it worth using Python's re.compile?](https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile) — Tomerikoo, Dec 20 '20 at 23:05
@Tomerikoo I should probably get around to posting on that question myself or asking a followup, because in my testing, it's *quartered* my execution time, even for only 10–20 regular expressions. I don't understand how, because there's no way that should overflow the cache, but I can't argue with results. And like I said, too... sometimes just the act of defining things all in one place and naming them can make intent clearer, regardless of performance. — CrazyChucky, Dec 20 '20 at 23:09
P.S. I double-checked—the program in question has, in fact, 36 regexes. Still far below the usual cache limit of 100, though. — CrazyChucky, Dec 20 '20 at 23:16
I'm not trying to dispute what you say. I just put it here because it's related. I came across this post when I was considering myself to change all my regexes in a program to compiled ones and then decided not to... — Tomerikoo, Dec 21 '20 at 07:52
It might be worth trying! Maybe at some point I can do a deeper dive and figure out more about why it had that effect for me (and presumably doesn't in other circumstances). — CrazyChucky, Dec 21 '20 at 16:24
As described in the answer of the link, the regular `search` etc. functions already cache compiled patterns themselves. So I guess it's a function of that cache's size and the amount of different patterns you have in a program... — Tomerikoo, Dec 21 '20 at 16:26
I suppose, but like I said, the cache size is 100 and I only have 36 regexes, so there's no way it should be overflowing. Something weird is going on, I'm just not sure what. — CrazyChucky, Dec 21 '20 at 16:27

score 0 · Answer 4 · answered Dec 17 '20 at 17:22

Use regex:

import regex

a_list = [
    '4years3months5days',
    '3 months 2 days',
    '4 years 2 months',
    '4 Years 3 Months',
    '4.6',
    '4Y3M',
    '10Y 3M',
    '10Y3M'
]

t = [
    '(?:(?<!years(?:\s*?)))(\d+?)(?:\s*?)(?:(?=months))',
    '(\d+?)(?:\s*?)(?:years)(?:\s*?)(\d+?)(?:(?=(?:\s*?)months))',
    '\\b(\d+?)(?:\.)(\d+?)\\b',
    '(\d+?)(?:\s*?)Y(?:\s*?)(\d+?)(?:(?=m))'
]

longest_len = len(max(t))

for i in a_list:
    for j in t:
        if regex.match(fr'{j}', i, flags=regex.I):
            r = regex.match(fr'{j}', i, flags=regex.I).groups()
            
            if len(r) > 1:
                print(f'{i:{longest_len}}', '=>', '.'.join(r))
            else:
                print(f'{i:{longest_len}}', ' => ', '0.', *r, sep='')

4years3months5days     => 4.3
3 months 2 days        => 0.3
4 years 2 months       => 4.2
4 Years 3 Months       => 4.3
4.6                    => 4.6
4Y3M                   => 4.3
10Y 3M                 => 10.3
10Y3M                  => 10.3

Splitting a string (case sensitive)

4 Answers4