I am analysing weblogs. I have a regex to search for patterns and extract relevant data. One thing I look at is user agents that visited the webserver and give them a count. If a user uses an iPhone, it may also include Mozilla in the UA.
57.55.39.83 - - [08/Mar/2020:18:52:38 -0700] "GET /Archive/Contentslist.htm HTTP/1.1" 200 9972 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "redlug.com"
77.247.22.51 - - [08/Mar/2020:18:53:56 -0700] "GET /logs/access_130930.log HTTP/1.1" 404 73 "http://www.purevolume.com/adapaleno" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Firefox/17.0" "redlug.com"
How can I develop a regex that counts Mozilla users only then, when no "iPhone" is included in the string?
This is my try:
while (i != len(entries)):
match = re.search(logPattern, entries[i])
if (match):
mozillaPattern = re.compile(r"([mM]ozilla+)(?!iPhone)")
userAgent = match.group(7)
mozillaMatch = re.search(mozillaPattern, userAgent)
if (mozillaMatch):
mozilla = mozilla + 1
i = i + 1
output += "\nUser agents matching Mozilla (excl. iPhone): " + str(mozilla)
Looks like my regex ([mM]ozilla+)(?!iPhone)
is wrong because it still counts both entries, although the first one includes "iPhone" in its string. Do you have any hint for a newbie like me? Thanks, Chris