0

I have a string s = '10000', I need using only the Python re.findall to get how many 0\d0 in the string s For example: for the string s = '10000' it should return 2

explanation: the first occurrence is 10000 while the second occurrence is 10000

I just need how many occurrences and not interested in the occurrence patterns

I've tried the following regex statements:

re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']

Finally, if I want to make this regex generic to fetch any number then any_number_included_my_number then the_same_number_again, how can I do it?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Mohamed Mohsen
  • 167
  • 1
  • 13
  • Capture in lookahead using `(?=(0\d0))`: `re.findall(r'(?=(0\d0))', s)` outputs `['000', '000']` – ctwheels Jan 02 '20 at 18:21
  • For the second one, I'm assuming you mean something like `(\d)\d\1`? – ctwheels Jan 02 '20 at 18:23
  • @ctwheels It works perfectly, you have saved my life, thanks a lot, post it as an answer to mark it as accepted answer :) could you give me some explanation for what is the ?= done to make it work – Mohamed Mohsen Jan 02 '20 at 21:40
  • I've converted my comments into an answer, hopefully I've made it more clear :) – ctwheels Jan 02 '20 at 21:57
  • If it doesn't matter what the match is, can just use the bump-allong affect like this `res = re.findall(r"(?=(\d)\d\1)", targ)` the length of the res list tells how many found. –  Jan 02 '20 at 22:21

1 Answers1

2

How to get all possible occurrences?

The regex

As I mentioned in my comment, you can use the following pattern:

(?=(0\d0))

How it works:

  • (?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
  • (0\d0) is a capture group matching 0, then any digit, then 0

The code

Your code becomes:

See code in use here

re.findall(r'(?=(0\d0))', s)

The result is:

['000', '000']

The python re.findall method states the following

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

This means that our matches are the results of capture group 1 rather than the full match as many would expect.


How to generalize the pattern?

The regex

You can use the following pattern:

(\d)\d\1

How this works:

  • (\d) captures any digit into capture group 1
  • \d matches any digit
  • \1 is a backreference that matches the same text as most recently matched by capture group 1

The code

Your code becomes:

See code in use here

re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])

Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.

The result is:

['000', '000']
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Thanks, it's really a very useful answer, I found that we can also do it without the \d in the pattern (\d)\d\1 to be something like that (.)(?=.\1) – Mohamed Mohsen Jan 03 '20 at 15:38
  • @MohamedMohsen yes, you can also do it that way, but `\d` ensures it's a digit whereas `.` matches any character except newline characters. – ctwheels Jan 03 '20 at 15:45