0

I want to parse the following javascript which I scraped from an HTML page:

  var ibmdebug = false; //indicates whether or not to display flash debug window
  if (qsParse.get("debug") == "true") {
    ibmdebug = true;
  }
  var matchStatsConfig = {
    courtId : "B",
    matchId : "5126",

matchStatus : "C"
,
    eventId : "MX",
    roundId : "1",
    dayMessage : "Day 5 Friday 7 July",
    relatedContentTags :  ['atpi200','wta316629','atpba79','wta316713'],
    team1 : {
      a : "atpi200",
      a_name : "D. Inglot",
      a_seed : "",
      b : "wta316629",
      b_name : "L. Robson",
      b_seed : ""
    },
    team2 : {
      a : "atpba79",
      a_name : "A. Begemann",
      a_seed : "",
      b : "wta316713",
      b_name : "N. Melichar",
      b_seed : ""
    }
  }

Based on this thread I use the package slimit as follows where js.text contains the javascript code as a string:

data = js.text
parser = Parser()
tree = parser.parse(data)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print(fields)

The output/content of fields looks as follows

{
    'ibmdebug': 'true',
    'courtId': '"B"',
    'matchId': '"5126"',
    'matchStatus': '"C"',
    'eventId': '"MX"',
    'roundId': '"1"',
    'dayMessage': '"Day 5 Friday 7 July"',
    'relatedContentTags': '',
    'team1': '',
    'a': '"atpba79"',
    'a_name': '"A. Begemann"',
    'a_seed': '""',
    'b': '"wta316713"',
    'b_name': '"N. Melichar"',
    'b_seed': '""',
    'team2': ''
}

As you can see, it is not parsed correctly (only parts are correct). The array relatedContentTags remains empty, as do the team1 and team2 objects/dictionaries. Interestingly, the content of the team2 variable is there. I assume this is the case because the content of team1 is also parsed, but overwritten by the content of team2.

My question is: How can I properly parse the initial javascript into a python data structure (e.g. dictionary)?

beta
  • 5,324
  • 15
  • 57
  • 99

1 Answers1

0

Since my javascript code did not use quotes for the keys it could only be parsed by demjson as explained here or here.

so my working code works as follows

# get only the "matchStatsConfig" variable
script = re.search(r'var matchStatsConfig = .*$', data, re.DOTALL).group()

# convert it to json
json = demjson.decode(script.replace('var matchStatsConfig =', '')) 
beta
  • 5,324
  • 15
  • 57
  • 99