1

I want to extract the botname with its version from user-agent strings. I tried using split function. But since the way of displaying user-agent string is different from one crawler to the other what is the best way to get my expected out put?(Please consider that i need a general solution)

Input(user-agent strings)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Expected output

Googlebot/2.1
AhrefsBot/4.0
msnbot/2.0b
Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121

1 Answers1

3

Try following:

import re

lines = [
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    'Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)',
    'msnbot/2.0b (+http://search.msn.com/msnbot.htm)'
]

botname = re.compile('\w+bot/[.\w]+', flags=re.IGNORECASE)
for line in lines:
    matched = botname.search(line)
    if matched:
        print(matched.group())

prints

Googlebot/2.1
AhrefsBot/4.0
msnbot/2.0b

assumed that bot agent names contain bot/.

falsetru
  • 357,413
  • 63
  • 732
  • 636