0

It's a really basic question here but I'm for some reason struggling to build the regex. I have a bunch of strings starting with X (space at the end) and then a list of strings (could have multiple words) that separated by a comma and have a dot at the end.

Examples:

X abc, abd.
X abc, abd, abcd.
X abc abd, abc.
X asdas, asdasd, adsasda, asdasda.
X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.

I'm trying to use the re module to get a list of all the strings between commas, so I get:

['abc', 'abd']
['abc', 'abd', 'abcd']
['abc abd', 'abc']
['asdas', 'asdasd', 'adsasda', 'asdasda']
['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']

I tried:

match = re.search('X\s+((.*)\,)+(.*)\.', content.text)

But it looks like it does not work:

enter image description here

Which regex I could use here?

Please note that the strings could have numbers and specials chars (like :;() and others).

vesii
  • 2,760
  • 4
  • 25
  • 71
  • You cannot do it with plain `re`-compatible regex. You either use a bit of code, or use PyPi regex library. See [RegEx with multiple groups?](https://stackoverflow.com/questions/4963691/regex-with-multiple-groups) for a generic approach. With PyPi regex, `(?:\G(?!\A)\W+|X\s+)(\w+)` [can be used](https://regex101.com/r/wu5rii/1), or `(?<=X\s.*)\w+` ([demo](https://regex101.com/r/wu5rii/2)) – Wiktor Stribiżew Sep 06 '21 at 17:14

4 Answers4

0

This is a way to achieve what you want using only regex:

import re

lst = ['X abc, abd.',
       'X abc, abd, abcd.',
       'X abc abd, abc.',
       'X asdas, asdasd, adsasda, asdasda.',
       'X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.']

[re.split(", ", re.search("X\s(.*)\.", i).group(1)) for i in lst]

enter image description here

This method uses part regex:

import re

lst = ['X abc, abd.',
       'X abc, abd, abcd.',
       'X abc abd, abc.',
       'X asdas, asdasd, adsasda, asdasda.',
       'X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.']

[[j.strip() for j in re.split(",", i.strip("X."))] for i in lst]

enter image description here

Ananay Mital
  • 1,395
  • 1
  • 11
  • 16
0

Seems like you could achieve that easily without a regex:

string = 'X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.'

result = string.lstrip('X ').rstrip('.').split(', ')

should do what you want. Result:

['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']

You can also shorten this to

result = string.strip('X .').split(', ')

but this will remove the given characters from both ends of the string.

If you have your whole text in one multi-line string, you can still do it in one line with list comprehension:

text = '''X abc, abd.
X abc, abd, abcd.
X abc abd, abc.
X asdas, asdasd, adsasda, asdasda.
X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.'''

result = [t.strip('X .').split(', ') for t in text.splitlines()]

Result:

[['abc', 'abd'], 
 ['abc', 'abd', 'abcd'], 
 ['abc abd', 'abc'], 
 ['asdas', 'asdasd', 'adsasda', 'asdasda'], 
 ['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']
]

Please note: This only works if the characters in X and . are different from the characters at the start respectively end of the string you want to keep. This is because strip doesn't mean "remove this substring from the ends of the string", but instead "remove any characters from the given set of characters from the ends of the string".

If your pattern in front e.g. looked like this

line = 'asdasX asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.'

the above approach would not work.

Instead you could also trim the sting based on the length of your pattern, optionally after verifying that it in fact starts and ends with the patterns you are looking for:

if line.startswith('asdasX') and line.endswith('.'):
    result = line[7:-1].split(', ')

Result:

['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']

or again, as list comprehension:

text = '''asdasX abc, abd.
asdasX abc, abd, abcd.
asdasX abc abd, abc.
asdasX asdas, asdasd, adsasda, asdasda.
asdasX asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas.'''

result = [t[7:-1].split(', ') for t in text.splitlines() if t.startswith('asdasX') and t.endswith('.')]

Result:

[['abc', 'abd'], 
 ['abc', 'abd', 'abcd'], 
 ['abc abd', 'abc'], 
 ['asdas', 'asdasd', 'adsasda', 'asdasda'], 
 ['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']
]

On Python 3.9 and newer you can use the removeprefix and removesuffix methods to remove an entire substring from either side of the string:

result = [t.removeprefix('asdasX').removesuffix('.').split(', ') for t in text.splitlines()]

See this SO post.

buddemat
  • 4,552
  • 14
  • 29
  • 49
0

Assuming that we can phrase the problem as wanting to find any one or more sequence of space-separated words, we can try using re.findall:

inp = ["X abc, abd.", "X abc, abd, abcd.", "X abc abd, abc.", "X asdas, asdasd, adsasda, asdasda.", "X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas."]
for i in inp:
    matches = re.findall(r'(?<=.)\w+(?: \w+)*', i)
    print(matches)

This prints:

['abc', 'abd']
['abc', 'abd', 'abcd']
['abc abd', 'abc']
['asdas', 'asdasd', 'adsasda', 'asdasda']
['asdas asdasda', 'asdasdas asdasda', 'asdasdasas', 'asdasddas']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

The easiest way to do this will be not with regex, but with a simple python script:

strings = ["X abc, abd.", "X abc, abd, abcd.", "X abc abd, abc.", "X asdas, asdasd, adsasda, asdasda.", "X asdas asdasda, asdasdas asdasda, asdasdasas, asdasddas."]

def split_words(list_of_strings):
    words_per_string = []
    
    for idx, s in enumerate(list_of_strings):
        words_per_string.append([])
        # remove X and first whitespace
        s = s[2:]
        splitted = s.split(",")
        for words in splitted:
            words_per_string[idx].append(words.strip())
            
    return words_per_string

split_words(strings)
matwasilewski
  • 384
  • 2
  • 11