2

I am working with a string of text that I want to search through and only find 4 letters words. It works, except it also finds 4+ letter words as well.

import re
test ="hello, how are you doing tonight?"
total = len(re.findall(r'[a-zA-Z]{3}', text))
print (total)

It finds 15, although I am not sure how it found that many. I thought I might have to use \b to pick the beginning and the end of the word, but that didn't seem to work for me.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
netrate
  • 423
  • 2
  • 8
  • 14

3 Answers3

12

Try this

re.findall(r'\b\w{4}\b',text)

The regex matches:

\b, which is a word boundary. It matches the beginning or end of a word.

\w{4} matches four word characters (a-z, A-Z, 0-9 or _).

\b is yet another word boundary.

**As a side note, your code contains typos, the second parameter of the re.findall should be the name of your string variable, which is test. Also, your string does not contain any 4 letter words so the suggested code will give the output of 0.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
diypcjourney
  • 209
  • 2
  • 6
  • Yes thank you. I did notice that after. I have made the correction and added in the \b as well. Great ! – netrate Feb 19 '18 at 23:44
0

Here's a way without regex:

from string import punctuation

s = "hello, how are you doing tonight?"

[i for i in s.translate(str.maketrans('', '', punctuation)).split(' ') if len(i) > 4]

# ['hello', 'doing', 'tonight']
jpp
  • 159,742
  • 34
  • 281
  • 339
0

You can use re.findall to locate all letters, and then filter based off of length:

import re
test ="hello, how are you doing tonight?"
final_words = list(filter(lambda x:len(x) == 4, re.findall('[a-zA-Z]+', test)))
Ajax1234
  • 69,937
  • 8
  • 61
  • 102