0

I am analysing weblogs. I have a regex to search for patterns and extract relevant data. One thing I look at is user agents that visited the webserver and give them a count. If a user uses an iPhone, it may also include Mozilla in the UA.

57.55.39.83 - - [08/Mar/2020:18:52:38 -0700] "GET /Archive/Contentslist.htm HTTP/1.1" 200 9972 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "redlug.com"
77.247.22.51 - - [08/Mar/2020:18:53:56 -0700] "GET /logs/access_130930.log HTTP/1.1" 404 73 "http://www.purevolume.com/adapaleno" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Firefox/17.0" "redlug.com"

How can I develop a regex that counts Mozilla users only then, when no "iPhone" is included in the string?

This is my try:

while (i != len(entries)):
    match = re.search(logPattern, entries[i])
    if (match):
        mozillaPattern = re.compile(r"([mM]ozilla+)(?!iPhone)")
        userAgent = match.group(7)
        mozillaMatch = re.search(mozillaPattern, userAgent)
        if (mozillaMatch):
            mozilla = mozilla + 1
    i = i + 1

output += "\nUser agents matching Mozilla (excl. iPhone): " + str(mozilla)

Looks like my regex ([mM]ozilla+)(?!iPhone) is wrong because it still counts both entries, although the first one includes "iPhone" in its string. Do you have any hint for a newbie like me? Thanks, Chris

chootbl
  • 13
  • 3
  • You did not let the regex match the text between the two words, use `r"[mM]ozilla(?!.*iPhone)"`. See [the answer](https://stackoverflow.com/a/39719452/3832970). – Wiktor Stribiżew Apr 03 '20 at 21:26
  • Hey chootbl, always avoid regex whenever possible (they're slow), in this case you can use two conditions for your strings. Also, `for i in entries` is easier to read than the `while` statement – Juan C Apr 03 '20 at 21:28
  • 1
    @WiktorStribiżew that worked. Thank you so much for your help – chootbl Apr 03 '20 at 21:35
  • @JuanC good point. Will take this into account and rework on the code – chootbl Apr 03 '20 at 21:36

0 Answers0