-2

I have the following code stored as a string variable in Python. How can I use regex, along with re.findall('', text), to parse out the five 9-digit numbers (all starting with "305...") under "attributeLookup" lookup in the below code?

var PRO_META_JSON = {
    "attributeDefinition":{
        "defaultSku":305557121,
        "attributeListing":[{ 
            "label":"Finish",
                    "defaultIndex":0,
                    "options":[
                        "White::f33b4086",
                        "Beige::8e0900fa",
                        "Blue::3c3a4707",
                        "Orange::1d8cb503",
                        "Spring Green::dd5e599a"
                     ]
            }],
            "attributeLookup":[
            [0,305557121],
            [1,305557187],
            [2,305557696],
            [3,305557344],
            [4,305696435]
            ]
        }
    };
user994585
  • 661
  • 3
  • 13
  • 28
  • 3
    This looks like `Java` code, not `Python`. Did you read this from a file? You can use the `JSON` library in `Python`, then dig down to that key/value and search using native `Python` without any need for regex. – Cory Kramer May 04 '15 at 13:15
  • 4
    I think you should use the [JSON parser](https://docs.python.org/2/library/json.html) instead. – kennytm May 04 '15 at 13:16
  • 2
    @Cyber it's not Java or Python. It's Javascript, apparently inside a Python string. – Rob Grant May 04 '15 at 13:19
  • Is this the only code or this is just the sample? – vks May 04 '15 at 13:20
  • this look like javascript – Pedro Lobito May 04 '15 at 13:20
  • regex is a really bad idea here. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – RickyA May 04 '15 at 13:22
  • @Cyber I'm scraping data out of the script tag of a webpage (using BeautifulSoup) and then attempting here to parse the data I need from it. I think you're right on using the `JSON library`, though I'll need to figure out how to do so. – user994585 May 04 '15 at 13:24
  • @user994585 check my or Julien's answers. – Rob Grant May 04 '15 at 13:30
  • @user994585 See the beginning of my answer to extract the relevant string that can be decoded by json module – Julien Spronck May 04 '15 at 13:45

4 Answers4

2

You can just use the built in json library to parse it. I've assumed you've got rid of the Javascript already:

import json

input = """{
"attributeDefinition":{
    "defaultSku":305557121,
    "attributeListing":[{ 
        "label":"Finish",
                "defaultIndex":0,
                "options":[
                    "White::f33b4086",
                    "Beige::8e0900fa",
                    "Blue::3c3a4707",
                    "Orange::1d8cb503",
                    "Spring Green::dd5e599a"
                 ]
        }],
        "attributeLookup":[
        [0,305557121],
        [1,305557187],
        [2,305557696],
        [3,305557344],
        [4,305696435]
        ]
    }
}"""

data = json.loads(input)

# Get a list you can do stuff with. This gives you:
# [[0, 305557121], [1, 305557187], [2, 305557696], [3, 305557344], [4, 305696435]]
els = data['attributeDefinition']['attributeLookup']

for el in els:
    # Each el looks like: [0, 305557121]
    print(el[1])
Rob Grant
  • 7,239
  • 4
  • 41
  • 61
  • When I try to run `data = json.loads(input)`, I get a `TypeError: expected string or buffer` error. My variable still has the `var PRO_META_JSON =` at the beginning, so I'm guessing that's why. When I try to manipulate the string and remove the first "x" characters, I get a `TypeError: unhashable type` error. Any idea what I'm doing wrong? Thanks for the help! – user994585 May 04 '15 at 13:41
  • Hm. Just `print` out the content of input before you try and use `json.loads` on it, so you can have a look at it. You might spot something there (e.g. in your data, there is a _trailing semicolon_ you need to get rid of) – Rob Grant May 04 '15 at 13:43
  • Thanks. I tried using `input[-1:]` to remove the trailing semicolon, but it gives a `TypeError: unhashable type` error. How can I trim that semicolon from the data? – user994585 May 04 '15 at 13:52
  • @user994585 sorry to insist but have you not seen my answer? I think it does exactly what you're trying to do – Julien Spronck May 04 '15 at 13:56
  • @user994585 you're looking for `input[:-1]` to do that one operation, or you can use Julien's code to trim it all in one go. – Rob Grant May 04 '15 at 14:53
1

Here is a way to do it. First parse your string to get the json object (everything inside the most outer braces). Then decode the json object using the json module and access what you need.

astr = '''var PRO_META_JSON = {
    "attributeDefinition":{
        "defaultSku":305557121,
        "attributeListing":[{ 
            "label":"Finish",
                    "defaultIndex":0,
                    "options":[
                        "White::f33b4086",
                        "Beige::8e0900fa",
                        "Blue::3c3a4707",
                        "Orange::1d8cb503",
                        "Spring Green::dd5e599a"
                     ]
            }],
            "attributeLookup":[
            [0,305557121],
            [1,305557187],
            [2,305557696],
            [3,305557344],
            [4,305696435]
            ]
        }
    };'''

import re
import json
pat = re.compile('^[^\{]*(\{.*\});.*$', re.MULTILINE|re.DOTALL)
json_str = pat.match(astr).group(1)
d = json.loads(json_str)

for x in d['attributeDefinition']['attributeLookup']:
    print x[1]
# 305557121
# 305557187
# 305557696
# 305557344
# 305696435
Julien Spronck
  • 15,069
  • 4
  • 47
  • 55
  • Thanks for the help. When I use this, the `json_str` line gives me a `TypeError: expected string or buffer` error. I believe this is because the `astr` is actually a BeautifulSoup object and not a string. However, when I turn it into a string with `astr_string = str(astr)`, I get this error on the `json_str` line: `AttributeError: 'NoneType' object has no attribute 'group'`. Any advice? – user994585 May 04 '15 at 14:01
  • The second error comes from the fact that the regular expression is not a match to your string. The snippet of code I wrote works with the string you provided but if the string is slightly different (like an additional space or line at the end), it won't find a match ... the regular expression probably needs adjustment to your exact string (which i cannot see). I made a small edit in my regular expression that should be more forgiving but without seeing your exact string, it is difficult to debug. – Julien Spronck May 04 '15 at 14:04
0
string = '''var PRO_META_JSON = {
    "attributeDefinition":{
        "defaultSku":305557121,
        "attributeListing":[{ 
            "label":"Finish",
                    "defaultIndex":0,
                    "options":[
                        "White::f33b4086",
                        "Beige::8e0900fa",
                        "Blue::3c3a4707",
                        "Orange::1d8cb503",
                        "Spring Green::dd5e599a"
                     ]
            }],
            "attributeLookup":[
            [0,305557121],
            [1,305557187],
            [2,305557696],
            [3,305557344],
            [4,305696435]
            ]
        }
    };'''

import json
data = json.loads(string.split('=', 1)[1].strip(';'))
for d in data['attributeDefinition']['attributeLookup']:
    print(d[1])

Don't know why you want to use regex. Do you also take your car to visit your neighbour?

Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
-5

in the findall you want to select the digits 0 to 9 over 9 characters like this. This still would be better using the json module rather than storing as a string.

I really useful tester for python regex can be found here

http://pythex.org/

re.findall('[0-9]{9}', PRO_META_JSON.split('attributeLookup')[1])