1

I have a such text to load: https://sites.google.com/site/iminside1/paste
I'd prefer to create a python dictionary from it, but any object is OK. I tried pickle, json and eval, but didn't succeeded. Can you help me with this?
Thanks!
The results:

a = open("the_file", "r").read()

json.loads(a)
ValueError: Expecting property name: line 1 column 1 (char 1)

pickle.loads(a)
KeyError: '{'

eval(a)
File "<string>", line 19
from: {code: 'DME', airport: "Домодедово", city: 'Москва', country: 'Россия', terminal: ''},
    ^
SyntaxError: invalid syntax
DominiCane
  • 1,263
  • 3
  • 16
  • 29
  • 1
    How didn't it work? Post the code you tried and how it failed. – Daenyth Aug 30 '10 at 15:35
  • Wait, are the keys really a wild mix of strings, plain identifiers and plain identifiers that happen to be keywords?? –  Aug 30 '10 at 15:54
  • If I understand you right - yes, all the keys are wild mix of strings :) Maybe I need to quote them first? If so, how can I do it without breaking quoted values? – DominiCane Aug 30 '10 at 16:00
  • It does sorta look like a pickle file to me. Try `f = open('the_file', 'r')` to open the file for reading, then `pickle.load(f)` to get the object named "data". – ewall Aug 30 '10 at 16:03
  • I've tried pickle before, see the result above. KeyError: '{' – DominiCane Aug 30 '10 at 16:08
  • Do you have any idea what data format this is in? (It's certainly not valid JSON.) – JanC Aug 30 '10 at 16:35
  • It's from web scraping, I think it was a JSON once, but a bit changed. I really don't know, but I need somehow to extract data from it. – DominiCane Aug 30 '10 at 16:39

4 Answers4

4

Lifted almost straight from the pyparsing examples page:

# read text from web page
import urllib
page = urllib.urlopen("https://sites.google.com/site/iminside1/paste")
html = page.read()
page.close()

start = html.index("<pre>")+len("<pre>")+3 #skip over 3-byte header
end = html.index("</pre>")
text = html[start:end]
print text

# parse dict-like syntax    
from pyparsing import (Suppress, Regex, quotedString, Word, alphas, 
alphanums, oneOf, Forward, Optional, dictOf, delimitedList, Group, removeQuotes)

LBRACK,RBRACK,LBRACE,RBRACE,COLON,COMMA = map(Suppress,"[]{}:,")
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
string_ = Word(alphas,alphanums+"_") | quotedString.setParseAction(removeQuotes)
bool_ = oneOf("true false").setParseAction(lambda t: t[0]=="true")
item = Forward()

key = string_
dict_ = LBRACE - Optional(dictOf(key+COLON, item+Optional(COMMA))) + RBRACE
list_ = LBRACK - Optional(delimitedList(item)) + RBRACK
item << (real | integer | string_ | bool_ | Group(list_ | dict_ ))

result = item.parseString(text,parseAll=True)[0]
print result.data[0].dump()
print result.data[0].segments[0].dump(indent="  ")
print result.data[0].segments[0].flights[0].dump(indent="  -  ")
print result.data[0].segments[0].flights[0].flightLegs[0].dump(indent="  -  -  ")
for seg in result.data[6].segments:
    for flt in seg.flights:
        fltleg = flt.flightLegs[0]
        print "%(airline)s %(airlineCode)s %(flightNo)s" % fltleg,
        print "%s -> %s" % (fltleg["from"].code, fltleg["to"].code)

Prints:

[['index', 0], ['serviceClass', '??????'], ['prices', [3504, ...
- eTicketing: true
- index: 0
- prices: [3504, 114.15000000000001, 89.769999999999996]
- segments: [[['indexSegment', 0], ['stopsCount', 0], ['flights', ... 
- serviceClass: ??????
  [['indexSegment', 0], ['stopsCount', 0], ['flights', [[['index', 0], ...
  - flights: [[['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ...
  - indexSegment: 0
  - stopsCount: 0
  -  [['index', 0], ['time', 'PT2H45M'], ['minAvailSeats', 9], ['flight...
  -  - flightLegs: [[['flightNo', '309'], ['eTicketing', 'true'], ['air... 
  -  - index: 0
  -  - minAvailSeats: 9
  -  - stops: []
  -  - time: PT2H45M
  -  -  [['flightNo', '309'], ['eTicketing', 'true'], ['airplane', 'Boe... 
  -  -  - airline: ?????????
  -  -  - airlineCode: UN
  -  -  - airplane: Boeing 737-500
  -  -  - availSeats: 9
  -  -  - classCode: I
  -  -  - eTicketing: true
  -  -  - fareBasis: IPROW
  -  -  - flightClass: ECONOMY
  -  -  - flightNo: 309
  -  -  - from:   -  -  [['code', 'DME'], ['airport', '??????????'], ... 
  -  -    - airport: ??????????
  -  -    - city: ??????
  -  -    - code: DME
  -  -    - country: ??????
  -  -    - terminal: 
  -  -  - fromDate: 2010-10-15
  -  -  - fromTime: 10:40:00
  -  -  - time: 
  -  -  - to:   -  -  [['code', 'TXL'], ['airport', 'Berlin-Tegel'], ... 
  -  -    - airport: Berlin-Tegel
  -  -    - city: ??????
  -  -    - code: TXL
  -  -    - country: ????????
  -  -    - terminal: 
  -  -  - toDate: 2010-10-15
  -  -  - toTime: 11:25:00
airBaltic BT 425 SVO -> RIX
airBaltic BT 425 SVO -> RIX
airBaltic BT 423 SVO -> RIX
airBaltic BT 423 SVO -> RIX

EDIT: fixed grouping and expanded output dump to show how to access individual key fields of results, either by index (within list) or as attribute (within dict).

drunkbn
  • 60
  • 5
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
3

If you really have to load the bulls... this data is (see my comment), you's propably best of with a regex adding missing quotes. Something like r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:" to find things to quote and r"\'\1\'\:" as replacement (off the top of my head, I have to test it first).

Edit: After some troulbe with backward-references in Python 3.1, I finally got it working with these:

>>> pattern = r"([a-zA-Z_][a-zA-Z_0-9]*)\s*\:"
>>> test = '{"foo": {bar: 1}}'
>>> repl = lambda match: '"{}":'.format(match.group(1))
>>> eval(re.sub(pattern, repl, test))
{'foo': {'bar': 1}}
  • Something wrong.. Trying your code with your example (TypeError: expected string or buffer): /usr/lib/python2.6/re.pyc in sub(pattern, repl, string, count) 149 a callable, it's passed the match object and must return 150 a replacement string to be used.""" --> 151 return _compile(pattern, 0).sub(repl, string, count) 152 153 def subn(pattern, repl, string, count=0): TypeError: expected string or buffer – DominiCane Aug 30 '10 at 16:26
  • Mixed `repl` and `string` argument order up, fixed it. –  Aug 30 '10 at 16:48
1

Till now with help of delnan and a little investigation I can load it into dict with eval:

pattern = r"\b(?P<word>\w+):"
x = re.sub(pattern, '"\g<word>":',open("the_file", "r").read())
y = x.replace("true", '"true"')
d = eval(y)

Still looking for more efficient and maybe simpler solution.. I don't like to use "eval" for some reasons.

Community
  • 1
  • 1
DominiCane
  • 1,263
  • 3
  • 16
  • 29
  • Well, it will hardly get more efficient than the built-in eval, but I understand. With the quoting fixed, I suppose it is valid JSON? –  Aug 30 '10 at 16:48
  • unfortunately not :O I still can not load it with json.loads() or pickle.loads().. Strange and confusing - only eval works, I don't understand why. (Shouldn't pickle work??) – DominiCane Aug 30 '10 at 17:02
0

Extension of the DominiCane's version:

import re

quote_keys_regex = re.compile(r'([\{\s,])(\w+)(:)')


def js_variable_to_python(js_variable):
    """Convert a javascript variable into JSON and then load the value"""
    # when in_string is not None, it contains the character that has opened the string
    # either simple quote or double quote
    in_string = None
    # cut the string:
    # r"""{ a:"f\"irst", c:'sec"ond'}"""
    # becomes
    # ['{ a:', '"', 'f\\', '"', 'irst', '"', ', c:', "'", 'sec', '"', 'ond', "'", '}']
    l = re.split(r'(["\'])', js_variable)
    # previous part (to check the escape character antislash)
    previous_p = ""
    for i, p in enumerate(l):
        # parse characters inside a ECMA string 
        if in_string:
            # we are in a JS string: replace the colon by a temporary character
            # so quote_keys_regex doesn't have to deal with colon inside the JS strings
            l[i] = l[i].replace(':', chr(1))
            if in_string == "'":
                # the JS string is delimited by simple quote.
                # This is not supported by JSON.
                # simple quote delimited string are converted to double quote delimited string
                # here, inside a JS string, we escape the double quote
                l[i] = l[i].replace('"', r'\"')

        # deal with delimieters and escape character
        if not in_string and p in ('"', "'"):
            # we are not in string
            # but p is double or simple quote
            # that's the start of a new string
            # replace simple quote by double quote
            # (JSON doesn't support simple quote)
            l[i] = '"'
            in_string = p
            continue
        if p == in_string:
            # we are in a string and the current part MAY close the string
            if len(previous_p) > 0 and previous_p[-1] == '\\':
                # there is an antislash just before: the JS string continue
                continue
            # the current p close the string
            # replace simple quote by double quote
            l[i] = '"'
            in_string = None
        # update previous_p
        previous_p = p
    # join the string
    s = ''.join(l)
    # add quote arround the key
    # { a: 12 }
    # becomes
    # { "a": 12 }
    s = quote_keys_regex.sub(r'\1"\2"\3', s)
    # replace the surogate character by colon
    s = s.replace(chr(1), ':')
    # load the JSON and return the result
    return json.loads(s)

It deals only with int, null and string. I don't know about float.

Note that the usage chr(1): the code doesn't work if this character in js_variable.