Python and Regular Expressions

Question

Good Day all,

I posted something similar earlier, so if you're running across this again, I apologize. This time around I'll be more specific and give you direct examples and portray exactly what I want. Basically, I need to make raw data look prettier:

str = '2011-06-1618:53:41222.222.2.22-somedomain.hi.comfw12192.10.215.11GET/965874/index.xls22233665588-0Mozilla/4.0 (compatible; MSI 5.5; Windows NT 5.1)'--55656-0.55-5874/659874540--'



more strings:
'2011-06-2150:36:1292.249.2.105-somedomain.hi.comfw12192.10.215.11GET/965874/ten.xls22233665588-0Mozilla/4.0 (compatible; MSI 6.0; Windows NT 5.1)'--55656-0.55-5874/659874540--'
'2011-01-1650:23:45123.215.2.215-somedomain.hi.comfw12192.10.215.11GET/123458/five.xls22233665588-0Mozilla/4.0 (compatible; MSI 7.0; Windows NT 5.1)'--55656-0.55-5874/659874540--'
'2011-02-1618:16:54129.25.2.119-thisdomain.hi.comfw12192.10.215.11GET/984745/two.xls22233665588-0Mozilla/4.0 (compatible; MSI 7.0; Windows NT 5.1)'--55656-0.55-5874/659874540--'
'2011-08-0525:22:16164.32.2.111-yourdomain.hi.comfw12192.10.215.11GET/85472/one.xls22233665588-0Mozilla/4.0 (compatible; MSI 8.0; Windows NT 5.1)'--55656-0.55-5874/659874540--'

IN THE DEBUGGER:

import re
str = '2011-06-1618:53:41222.222.2.22-somedomain.hi.comfw12192.10.215.11GET/965874/index.xls22233665588-0Mozilla/4.0 (compatible; MSI 5.5; Windows NT 5.1)'--55656-0.55-5874/659874540--'
domain = re.compile('^.*?(?=([fw].+?))')
domain.search(str).group()
'2011-06-1618:53:41222.222.2.22-somedomain.hi.com'
domain = domain.search(str).group()

So for getting the domain, I need to get rid of everything before the dash(-), right before the domain name. I can look for that value with this RE ([0-9]{3,5}).([0-9]{1,3}.){2}[0-9]{1,3}[-] But I don't know how to say, find that value and return everything AFTER it, but BEFORE fw12.

at the end of the day, i want those strings to look like this, using comma(, ) as a delimiter:

2011-08-05, 25:22:16, 164.32.2.111, yourdomain.hi.com, GET/85472/one.xls, Mozilla/4.0 (compatible; MSI 8.0; Windows NT 5.1)

In order to parse this, no matter what technology you use, you're going to need to have some way to distinguish the trailing part of the domain name from whatever follows it. Can you express in English how that can be done? Will the following text always be "fw12" and will the domain not have that string in it? — Peter Alfvin, Jun 26 '13 at 20:36
web logs without separators between fields? weird configuration :s — mdeous, Jun 26 '13 at 20:37
are all of the IP addresses in a specific range? if it's not, it may be hard to build a regex that would be aware of the end of the "fwXX" part, and the beginning of the IP address. — mdeous, Jun 26 '13 at 20:44
it looks very much like you have hour values over 24. that's a little unusual. there's 54 seconds too. wut?! — andrew cooke, Jun 26 '13 at 21:23

score 2 · Answer 1 · answered Jun 26 '13 at 21:34

Preferred but-maybe-not-possible Method

This looks like (as MatToufoutu pointed out) an Apache log file. If that is in fact the case, then you may be able to use apachelog or something similar to process it. You will need your Apache's httpd.conf/apache2.conf file string to use as the formatter. As I don't have yours, I just used the one provided in apachelog's documentation:

import apachelog

format = r'%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" '
log_line = """212.74.15.68 - - [23/Jan/2004:11:36:20 +0000] "GET /images/previous.png HTTP/1.1" 200 2607 "http://peterhi.dyndns.org/bandwidth/index.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021202" """

p = apachelog.parser(format)
data = p.parse(log_line)

You can then access various parts of the log file by accessing data's attributes

print "%s, %s, %s, %s, %s" % (data['%t'], data['%h'], data['%{Referer}i'], data['%r'], data['%{User-Agent}i'])

to get the output

[23/Jan/2004:11:36:20 +0000], 212.74.15.68, http://peterhi.dyndns.org/bandwidth/index.html, GET /images/previous.png HTTP/1.1

Using Regular Expressions

Alternatively, you could take your initial approach and use regular expressions to parse the line. The following should work. They're broken up into named groups so as to be easier to A) read B) edit C) understand:

import re


your_string = "2011-06-1618:53:41222.222.2.22-somedomain.hi.comfw12192.10.215.11GET/965874/index.xls22233665588-0Mozilla/4.0 (compatible; MSI 5.5; Windows NT 5.1)'--55656-0.55-5874/659874540--"

pattern = re.compile(r'(?P<date>\d{4}(:?-\d{2}){2})(?P<time>(:?\d{2}:?){3})(?P<ip_address1>(:?\d{1,3}\.?){4})-(?P<domain>[\w\.]+)fw12(?P<ip_address2>(:?\d{1,3}\.?){4})(?P<get>(:?GET/(:?\d+/)).*?)\d+-0(?P<user_agent>.*?)\'--.*$')
result = pattern.match(your_string)

You can then access the results with result.group('groupname'), like:

print "%s %s, %s, %s, %s, %s" % (result.group('date'), result.group('time'), result.group('ip_address1'), result.group('domain'), result.group('get'), result.group('user_agent'))

Which will return:

2011-06-16 18:53:41, 222.222.2.22, somedomain.hi.com, GET/965874/index.xls, Mozilla/4.0 (compatible; MSI 5.5; Windows NT 5.1)

Since this method deals with regular expressions, I always like to add my little disclaimer:

You're parsing data. It falls on you and your judgment on how much tolerance, sanitation, and validation you require. You may need to modify the above to better suit your requirements, and to work properly with real world data not included in your sample(s). Ensure you understand what the regular expressions are doing so that you know how this code is working.

score 0 · Answer 2 · answered Jun 26 '13 at 20:34

0

To separate each fields, I suggest you to use this pattern (then you join the matches with the delimiter you want):

(\d{4}-\d{2}-\d{2})(\d{2}:\d{2}:\d{2})(\d+(?:\.\d+){3})-([a-z.]+)fw\d+(?:\.\d+){3}(GET\/\d+\/[a-z.]+)[-\d]+([^'-]+)

answered Jun 26 '13 at 20:34

Casimir et Hippolyte

88,009
5
94
125

Python and Regular Expressions

2 Answers2