4

I'm trying to match all of the "words" with an intrusive asterisk in it, including at the beginning and the end (but no other punctuation).

For example, I'm expecting seven matches below. Instead, I got two.

text = "star *tar s*ar st*r sta* (*tar) (sta*) sta*."
p = re.compile(r"\b\w*\*+\w*\b")
p.findall(text) # ['s*ar', 'st*r']
# Expected ['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']

I understand that the reason is the asterisk is not considered part of a word bounded by the \b meta-character, but after reading all of Python's How-to, I still don't quite know how to get what I want.

bongbang
  • 1,610
  • 4
  • 18
  • 34

4 Answers4

2

Thanks for editing in the expected output.

So, in addition to the excellent solution by @benvc, this one takes recursion into account so if you are looking to capture when the text has multiple *'s the entire found string will be captured and won't ignore other *'s

#Acting on your original text string
>>> text = "star *tar s*ar st*r sta* (*tar) (sta*) sta*."
>>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']



#Acting on a slightly **MORE COMPLEX** string and returning it accurately
>>> text = "*tar *tar* star s*a**r *st*r* sta* (*tar) st*r** (sta**) s*ta*."
>>> re.findall('((?:[a-z\*]*(?:\*)(?:[a-z\*]*)))+', text)
['*tar', '*tar*', 's*a**r', '*st*r*', 'sta*', '*tar', 'st*r**', 'sta**', 's*ta*']

.

Let me know if you want me to explain how this works if you might need it for future reference.

FailSafe
  • 482
  • 4
  • 12
  • 1
    Thank you. Your answer gave me ideas, even though it unfortunately will match "words" with only asterisks such as `*` and `****`, which, as @benvc intuits, is not desired. – bongbang Mar 13 '19 at 00:05
  • You brought up a great point about the possibility of non-consecutive asterisks in a word, though. I do want such a word captured. I'm going with your solution and dealing w/ asterisk-only cases outside regex. – bongbang Mar 13 '19 at 00:14
  • Wow. I hadn't thought to test `***`. Hmmm... getting rid of that will be quite the challenge, but at the very least you can likely use another regex to test to see if there are any letter in the string and if not, discard it. – FailSafe Mar 13 '19 at 01:03
  • 1
    Why do you need those three non-capturing groups? I don't see what they add to the pattern. `'([a-z\*]*\*[a-z\*]*)+'` seems to yield the same result. – bongbang Mar 13 '19 at 05:54
  • They aren't absolutely necessary. I use them due to force of habit to design for scalability. As long as `(?:\*)`, `\*`, or `[\*]+`, appears in the middle, it should work fine as you mention. – FailSafe Mar 13 '19 at 12:11
1

You don't need the word boundaries with re.findall since it will find all the matches in a string for your specified regex. You also need to ensure that the match includes at least one word character so you don't match a single asterisk. For example:

import re

text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'

matches = re.findall(r'\w+\*\w*|\w*\*\w+', text)
print(matches)
# ['*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']
benvc
  • 14,448
  • 4
  • 33
  • 54
  • Thanks, but I do *not* want the parentheses, nor any other punctuations. I just want the "word". – bongbang Mar 12 '19 at 02:17
  • @bongbang - in that case, you are really close. You just don't need the word boundaries with `re.findall`. See edit. – benvc Mar 12 '19 at 02:19
1

Try using this regex:

(\w*\*+\w*)+

First off, I suggest using an online tool to test your regexs like regexr.com.

Second, \b looks for a word boundary or the end of a word. What you want is the word character \w. The regex shown above finds either word characters or asterisks, then the + causes it to match entire words instead of just individual letters. Note that this cannot be the asterisk quantifier as each word must have at least one letter. Finally, the expression is wrapped in a capturing group for later use.

Python code:

import re

pattern = r”(\w*\*+\w*)+”
text = “star *tar s*ar st*r sta* (*tar) (sta*) sta*”
p = re.findall(pattern, text)

Edit: thanks to @benvc, I was able to update my expression to exclude ‘star’.

  • Note that this will match the word "star" with no asterisk since you are matching any of either a word character or asterisk but it does not require both of those to be present in the match. – benvc Mar 12 '19 at 02:33
0

You can try this one. It is even simpler.

import re

text = 'star *tar s*ar st*r sta* (*tar) (sta*) sta*.'

p = re.findall(r'[\w*]+', text)
print(p)

Output:

['star', '*tar', 's*ar', 'st*r', 'sta*', '*tar', 'sta*', 'sta*']
YusufUMS
  • 1,506
  • 1
  • 12
  • 24
  • Note that this will match the word "star" with no asterisk since you are matching any of either a word character or asterisk but it does not require both of those to be present in the match, but OP's expected output excludes the word "star". – benvc Mar 12 '19 at 03:39
  • Ah, I didn't read the question clearly. Thanks. So, you can change with this `(\w*\*+\w*)+` – YusufUMS Mar 12 '19 at 04:36