3

I'm learning about regex. If I want to find all the 5 letter words in a string, I could use:

import re
text = 'The quick brown fox jumps over the lazy dog.'
print(re.findall(r"\b[a-zA-z]{5}\b", text))

But I want to write a simple function whose argument includes the string and the length of the word being found. I tried this:

import re
def findwords(text, n):
    return re.findall(r"\b[a-zA-z]{n}\b", text)    

print(findwords('The quick brown fox jumps over the lazy dog.', 5))

But this returns an empty list. The n is not being recognized.

How can I specify an argument with the number of repetitions (or in this case, the length of the word)?

CAustin
  • 4,525
  • 13
  • 25
cDub
  • 488
  • 1
  • 6
  • 19
  • Something like `r"\b[a-zA-z]{" + n + r"}\b"` ? –  Mar 20 '18 at 20:45
  • 1
    Possible duplicate of [How do I put a variable inside a String in Python?](https://stackoverflow.com/questions/2960772/how-do-i-put-a-variable-inside-a-string-in-python) – Aran-Fey Mar 20 '18 at 21:03

3 Answers3

5

Python does not magically fill the value of n into the string. For this you either need to use format:

r"\b[a-zA-z]{{{}}}\b".format(n)

or, if you are running Python >= 3.6, use the new f-strings (which can be combined with the r prefix denoting a raw string):

fr"\b[a-zA-z]{{{n}}}\b"

In both cases you need the outer two {{}} to create a literal {} and the inner is a format placeholder.

If you want to avoid having to escape the literal {}, you can use the older %-formatting to achieve the same thing. For this n needs to always be an integer (which it is here):

r"\b[a-zA-z]{%i}\b" % n
Graipher
  • 6,891
  • 27
  • 47
  • This explains a lot. I see now how to use fr. But would the use of 6 brackets be clean enough python? Is it something you'd see in professional programming? – cDub Mar 20 '18 at 21:16
  • @Christy Yes, I think so. There is always the alternative of using `%` formatting in that case, though: `r"\b[a-zA-z]{%i}\b" % n`. – Graipher Mar 20 '18 at 21:36
4

It's simpler than you may realize. There is nothing special about a "regex string": it is a simple, basic, everyday text string. About the only thing remotely remarkable is that it is usually defined with the r prefix, because the backslash means something in (unprefixed) Python strings as well, and you don't want to double up these, and ... it is fed as-is into Python's internal regex module.

So where the string comes from, doesn't really matter! Construct it any way you like, then feed the result into re.findall:

def findwords(text, n):
    return re.findall(r"\b[a-zA-z]{" +str(n) + r"}\b", text)

>>> findwords(text, 3)
['The', 'fox', 'the', 'dog']
>>> findwords(text, 4)
['over', 'lazy']

Note the repeated use of r, because it is not a regex peculiarity but a Python one, and you need to prefix all separate strings with it to prevent backslashes running rampant and messing up your carefully constructed expression.

(The same goes for the input to this function. This will also work, unless you test the argument and reject non-numbers:

>>> findwords(text, '5} {1')
['quick ', 'brown ', 'jumps ']

... which I did not.)

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • Still working on understanding; why would we change n into a string if it represents a length? – cDub Mar 20 '18 at 21:19
  • 1
    @Christy Because `"a" + 5` is not defined in Python, whereas `"a" + str(5) == "a5"`. – Graipher Mar 20 '18 at 21:39
  • @Christy: don't forget that a regex argument is still a *string*. There are no 'numbers' in it. The regex parser is responsible for recognizing any numbers as such, not Python. – Jongware Mar 20 '18 at 22:25
2

This can be done very easily without generating a regex pattern. Just simply extract all words and then use list comprehension to gather all words of length n.

See code in use here

import re

text = 'The quick brown fox jumps over the lazy dog.'
words = re.findall(r"[a-zA-Z]+", text)

print([w for w in words if len(w) == 3])

Result: ['The', 'fox', 'the', 'dog']

ctwheels
  • 21,901
  • 9
  • 42
  • 77