-1

I m generally curious why re.findall makes sutch weid stuff as finding empty strings, tuples (what that suppose to mean). It seems it does not take clausures () normally, als o interpretes | wrong like ab | cd is (ab)| (cd) , not a (b|c)d like you would think normally. Because of that i cant define regex what i need.
But in this example ie see clear wrong behaviour on the simple pattern:

([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}

what describes simple urls like gskinner.com, www.capitolconnection.org what you can see on regex help in https://regexr.com/ , i recognize with re.findall :

hotmail.
living.
item.
2.
4S.

means letters then just. How can that be?

Full code, where i try to filter out jonk from the text is :

import re

singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'


digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'



#small_word = '[a-zA-Z0-9]{1,3}'

#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'


email = singles + '\S+@\S*'






http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'

http = '(http|https|www)' + http_str

web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'


pat = email + '|' + digits_str

d_pat = re.compile(web_address)

text =  '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama@hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
 directly to the vendor for any bills pre 4/20.  I will fax you copies.  I will also try and get the payphone transferred.

www.capitolconnection.org <http://www.capitolconnection.org>.

and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''


print('findall:')

for x in re.findall(d_pat,text):
    print(x)


print('split:')
for x in re.split(d_pat,text):
    print(x)
user8426627
  • 903
  • 1
  • 9
  • 19

2 Answers2

1

From the documentation of re.findall:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Your regex has groups, namely the part in parenthesis. If you want to display the entire match, put your regex in one big group (put parenthesis around the whole thing) and then do print(x[0]) instead of print(x).

Felk
  • 7,720
  • 2
  • 35
  • 65
  • can i turn off the stuff that clausures are 'groups' and have normal regex strict to DFA definition?? – user8426627 Jun 06 '19 at 15:46
  • you can have non-capturing groups by prefixing them with `?:`, so in your case `(?:[a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}` – Felk Jun 06 '19 at 15:47
  • yes that helps, ty. so i put this in all clausures and it works as it should. Fnk python defs, thanks to you :D – user8426627 Jun 06 '19 at 15:48
0

I'm guessing that our expression has to be modified here, and that might be the problem, for instance, if we wish to match the desired patterns we would start with an expression similar to:

([a-zA-Z0-9]+)\.

if we wish to have 1 to 3 chars after the ., we would expand it to:

([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?

Demo 1

Demo 2

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"

test_str = ("hotmail.\n"
    "living.\n"
    "item.\n"
    "2.\n"
    "4S.\n"
    "hotmail.com\n"
    "living.org\n"
    "item.co\n"
    "2.321\n"
    "4S.123")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    no , the point is the helper here https://regexr.com/ match fhr full web adress and python re matches word+ dot for some reason – user8426627 Jun 06 '19 at 15:41
  • can i turn off somehow the feature that in clausures is viewed as some 'group' and have normal regex funtionality strict to DFA definition? – user8426627 Jun 06 '19 at 15:44