Extracting data using regex in python

Question

I have a string variable whose data looks something like this:

a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"testmail@test.com";s:9:"productid";s:1:"1";}

I am trying to extract the quoted data into a dictionary as a key-value pair like so:

{"status":"Active","checkdate":20130807,.............}

I tried extracting it using the following:

tempkeyresults = re.findall('"(.*?)"([^"]+)</\\1>', localdata, flags=re.IGNORECASE)

I'm quite new to regex and I assume what I am trying to query translates to "find and extract all data between " and " and extract it before the next "..." However, this returns and empty string([]). Could someone tell me where I am wrong?

Thanks in advance

Do you have any types other than `s` (presumably for 'string')? Do you need to validate that the lengths are correct? How are embedded double quotes in the values handled? Is this a standard format? If so, is there not a standard package to extract the information? — Jonathan Leffler, Aug 07 '13 at 07:43

score 2 · Answer 1 · edited May 23 '17 at 12:20

How about this?

>>> import re
>>> s = 'a:15:{s:6:"status";s:6:"Active";s:9:"checkdate";s:8:"20130807";s:11:"companyname";s:4:"test";s:11:"validdomain";s:19:"test";s:7:"md5hash";s:32:"501yd361fe10644ea1184412c3e89dce";s:7:"regdate";s:10:"2013-08-06";s:14:"registeredname";s:10:"TestName";s:9:"serviceid";s:1:"8";s:11:"nextduedate";s:10:"0000-00-00";s:12:"billingcycle";s:8:"OneTime";s:7:"validip";s:15:"xxx.xxx.xxx.xxx";s:14:"validdirectory";s:5:"/root";s:11:"productname";s:20:"SomeProduct";s:5:"email";s:19:"testmail@test.com";s:9:"productid";s:1:"1";}'
>>> results = re.findall('"(\w+)"', s)
>>> dict(zip(*[iter(results)] * 2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

\w means "any word character" (letters, numbers, regardless of case, and underscore (_))
+ means 1 or more.
dict(zip(*[iter(results)] * 2)) is very well explained in this answer

And could you explain `zip(*[iter(results)] * 2)`. I am seeing this for the first time.. — rahuL, Aug 07 '13 at 07:48

zhangyangyu · Accepted Answer · 2013-08-07T07:57:31.857

This one, find all the words surrounding by quotes and then slices the list to mapping:

>>> res = re.findall('"(\w+)"', s)
>>> i = iter(res)
>>> dict(zip(*[i]*2))
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

Or use this one. This will use regex to find all the pairs(adjacent two):

>>> res = re.findall('"(\w+)"(?:.*?)"(\w+)"', s)
>>> res
[('status', 'Active'), ('checkdate', '20130807'), ('companyname', 'test'), ('validdomain', 'test'), ('md5hash', '501yd361fe10644ea1184412c3e89dce'), ('regdate', 'registeredname'), ('TestName', 'serviceid'), ('8', 'nextduedate'), ('billingcycle', 'OneTime'), ('validip', 'validdirectory'), ('productname', 'SomeProduct'), ('email', 'productid')]
>>> dict(res)
{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': 'registeredname', 'TestName': 'serviceid', 'email': 'productid', 'billingcycle': 'OneTime', 'validip': 'validdirectory', '8': 'nextduedate', 'productname': 'SomeProduct', 'checkdate': '20130807'}

one more doubt - if the words have hyphens (-) or commas inbetween, how would we include that in the expression? — rahuL, Aug 07 '13 at 09:19

score 1 · Answer 3 · answered Aug 07 '13 at 08:13

You can do it without regular expressions.

parts = s.split('"')[1::2] # get all quoted text in a list
keys, values = parts[::2], parts[1::2] # take even and odd items (keys, values)
results = dict(zip(keys, values)) # turn it into a dict

results:

{'status': 'Active', 'companyname': 'test', 'validdomain': 'test', 'productid': '1', 'md5hash': '501yd361fe10644ea1184412c3e89dce', 'regdate': '2013-08-06', 'registeredname': 'TestName', 'email': 'testmail@test.com', 'serviceid': '8', 'nextduedate': '0000-00-00', 'billingcycle': 'OneTime', 'validip': 'xxx.xxx.xxx.xxx', 'productname': 'SomeProduct', 'checkdate': '20130807', 'validdirectory': '/root'}

Extracting data using regex in python

3 Answers3