0

I'm scraping data from a website using scrapy in Python.

Required data lies in a script tag as follows:

<script type="text/javascript">
getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");
</script>

I can get this content using xpath as follows:

item['lat'] = tree.xpath('//script[@type="text/javascript"]/text()'.extract()[0].encode('utf-8')
item['long'] = tree.xpath('//script[@type="text/javascript"]/text()'.extract()[0].encode('utf-8')

Then

item['lat'] = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'

item['long'] = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'

But how can I parse these contents so that

item['lat'] is equal to "41.8507029"
item['long'] is equal to "-87.8033709"
item['city'] is equal to "BERWYN"
item['state'] is equal to "IL"

Can I get any suggestions to solve this.

Zoe
  • 27,060
  • 21
  • 118
  • 148
Avinash Clinton
  • 543
  • 1
  • 8
  • 19
  • You should split the string by comma and then you get an array with values. use the array to get your desired values. keep in mind the value may have a double quote so you may need to remove that too. – Inus Saha Jun 20 '18 at 09:26

4 Answers4

2

Since this call is also valid Python syntax, we can use the ast module. Plus the arguments are all string literals, which makes things simpler.

import ast

line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'

print([arg.s for arg in ast.parse(line).body[0].value.args])

Output:

['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402', '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709']

Explanation:

print([arg.s           # value of string literal
       for arg in
       ast.parse(line)
      .body            # module (list of statements)
       [0]             # first statement (an Expr node)
      .value           # expression (a Call)
      .args            # arguments to function call
       ])
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
1

Try this with re

import re
temp_string = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
split_list = filter(None, re.split("[, \-!?:\"]+",temp_string))
print split_list

Should produce the following output:

['getDetailsfrmBean(', 'storePg', '564', 'Berwyn', 'IL', '7180', 'W', 'CERMAK', 'RD.', 'SPACE', 'A1', 'BERWYN', 'IL', 'US', '60402', '(708)', '788', '5097', '{Monday', 'Saturday=10', '9', 'sunday=11', '6}', '41.8507029', '87.8033709', ');']

Picked this up from the answer here : https://stackoverflow.com/a/23720594/5907969

Ash Sharma
  • 470
  • 3
  • 18
1

You can use a simple regex to extract just the comma-separated quoted strings part:

import re

line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'

args_string = re.match(r'getDetailsfrmBean\((.+)\);$', line.strip()).group(1)
print(args_string)

Output:

"storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709"

Then there are various ways to parse a list of strings from this kind of data:

import ast
import json
import csv

args_array = '[%s]' % args_string

assert (json.loads(args_array)
        == ast.literal_eval(args_array)
        == next(csv.reader([args_string]))
        == ['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402',
            '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709'])
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
1

Was about to write an answer which contains both methods with ast and re+json - but @Alex Hall was faster with the ast method, which imho is to prefer - but another method would involve a simple regular expression and the json module, which also gives you a list and can scan multiple function calls in the same string:

import re
import json

fn_cutter = re.compile("getDetailsfrmBean\((.+?)\);")

for key in item:
  for i, match in enumerate(fn_cutter.findall(item[key])):
    print(key, i, ':', json.loads("[" + match + "]"))

Online demo here

This would save you some time when converting JSON objects to Python structures and catching multiple method calls within the same value - but it will certainly not be able to handle anotherMethod(args) or ...value contained in JS method calls.

wiesion
  • 2,349
  • 12
  • 21