2

I'm pretty new to regular expressions but decided to use them to unserialize PHP arrays. Here's some background info:

I rewrote a database-based website for companies in django which was written in PHP. There is an M2M relation with companies and industries. In the previous model it was solved by using serialized PHP arrays so I now have to sync everything correctly. My first attempt was some splitting and cutting and it was really ugly so I decided to dive into regular expressions. Here is what I got (it's working perfectly fine) now:

def unserialize_array(serialized_array):
    import re
    matched_sub = re.search('^a:\d+:\{i:\d+;s:\d+:"(.*?)";\}$', serialized_array).group(1)
    industry_list = re.sub('";i:\d+;s:\d+:"', "? ", matched_sub).split("? ")
    new_dict = dict(enumerate(industry_list))
    return new_dict

I was wondering however if I couldn't do all this with a single regular expression instead of two.

Fynn Becker
  • 1,278
  • 2
  • 18
  • 21
  • Possible duplicate of [Unserialize PHP data in python](http://stackoverflow.com/questions/5935501/unserialize-php-data-in-python) – Henrique Barcelos Jan 04 '16 at 13:14
  • No, decided against the use of packages like phpserialize because it does more than I need and would just cause unnecessary errors for example with the length of my array data as it contains German umlauts. – Fynn Becker Jan 04 '16 at 13:34
  • @FynnBecker: can you provide some feedback about the solution i proposed? My assumption is correct or may you provide some example input? – Giuseppe Ricupero Jan 05 '16 at 21:20

2 Answers2

0

Suggesting to use re.sub with callback, which will just call unserialize_array in recursion.

ankhzet
  • 2,517
  • 1
  • 24
  • 31
  • Didn't know about that and just looked up how it works (python beginner as well), prettier for sure so thanks for that. However my question remains, is it somehow doable in one expression? – Fynn Becker Jan 04 '16 at 13:28
0

Update: updated to handle correctly also escaped quotes \" (actually written \\") and any escaped sequence (as an escaped quotes after an escaped escape\\\" that is \\\\\\").


I think, if i understood correctly the structure of your input, that you can rewrite your method as this:

def unserialize_array(serialized_array):
    import re
    return dict(enumerate(re.findall(r'"((?:[^"\\]|\\.)*)"', serialized_array)))

Assumed input (as is not specified explicitly in your question):

a:3:{i:1;s:9:"industry\\\\\\"A\\"";i:2;s:9:"\\"industry2\\"";i:3;s:9:"industry3"}

Output

{0: 'industry1\\\"A\"', 1: '\"industry2\"', 2: 'industry3'}

(actually: {0: 'industry1\\\\\\"A\\"', 1: '\\"industry2\\"', 2: 'industry3'})

How it works

There is no need to match the entire structure of the serialized array, cause we are interested only for the string contents. The regex "((?:[^"\\]|\\.)*)" simply extract per char till encounter an escape '\' (in that case accept escape + another char) or the closing double quotes ".

The capturing group assure that the double quotes are removed in the final result.

Finally the re.findall method returns in one single call a list of strings composed by our desired results.

This is a peculiarity of re.findall that overrides the entire match if at least a capturing group is present the matches (or by the capturing group in our case). Infact the docs declares:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Giuseppe Ricupero
  • 6,134
  • 3
  • 23
  • 32
  • Perfect, that's what I was looking for. Didn't know about findall which fits my case really well and about using capture groups like that to remove the quotation marks. Also some confusion with thinking I had to match the full string in order for it to work. So thanks a lot for your answer, really helpful. – Fynn Becker Jan 06 '16 at 10:25