I m generally curious why re.findall makes sutch weid stuff as finding empty strings, tuples (what that suppose to mean). It seems it does not take clausures () normally, als o interpretes | wrong like ab | cd is (ab)| (cd) , not a (b|c)d like you would think normally. Because of that i cant define regex what i need.
But in this example ie see clear wrong behaviour on the simple pattern:
([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}
what describes simple urls like gskinner.com, www.capitolconnection.org what you can see on regex help in https://regexr.com/ , i recognize with re.findall :
hotmail.
living.
item.
2.
4S.
means letters then just. How can that be?
Full code, where i try to filter out jonk from the text is :
import re
singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'
digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'
#small_word = '[a-zA-Z0-9]{1,3}'
#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'
email = singles + '\S+@\S*'
http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'
http = '(http|https|www)' + http_str
web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'
pat = email + '|' + digits_str
d_pat = re.compile(web_address)
text = '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama@hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
directly to the vendor for any bills pre 4/20. I will fax you copies. I will also try and get the payphone transferred.
www.capitolconnection.org <http://www.capitolconnection.org>.
and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''
print('findall:')
for x in re.findall(d_pat,text):
print(x)
print('split:')
for x in re.split(d_pat,text):
print(x)