3

I am using scrapy(Python) to scrape all the addresses from http://www.heteropharmacy.com/outlets.html. The City/Town drop down list contains many cities. Whenever I select a city, new addresses are displayed.

However, no request is made to the server. I used both firebug Lite and the developer tools in Chrome. There were no POST/GET requests made to the server.

When I looked at the source code, I found this:

<script src="jScript/myScript.js" type="text/javascript"></script>

When "jScript/myScript.js" is clicked, I get redirected to http://www.heteropharmacy.com/jScript/myScript.js. This source code is a javascript file and contains all the addresses of all the cities in the drop-down box. These addresses are inside an array.

My question is how do i get the html code of this javascript code , so that I can extract it using scrapy. Or can I extract directly from the javascript file.I would appreciate all possible solutions and am willing to use any API not only Scrapy.

I searched a lot in the internet and I could only find solutions for those cases where requests are made to the server.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195

4 Answers4

0

I would extract the Javascript code and use some library to execute the JS code and retrieve the result from there because as I can see it the code will generate a JS array that you can extract.

Perhaps this library to run JS code in Python can help https://pypi.python.org/pypi/PyExecJS

Fredrik Norling
  • 3,461
  • 17
  • 21
0

Best way would be to use BeautifulSoup. Firstly, convert the raw myScript.js file into an HTML. You can use this HTML file to create the soup.

After the soup has been created use regex to extract the data that you want. Supposing your HTML is in html_doc

html_code = html_doc.encode('utf-8')
soup = BeautifulSoup(html_code)
script = soup.find_all("script")

'script' will contain a string of the javascript file which can be parsed using regex. Hope this helps.

Aditya
  • 551
  • 5
  • 26
0

You can also extract this data using urllib2 and then performing regular expressions. This may be little messy, but works.

import urllib2
import re

url = 'http://www.heteropharmacy.com/jScript/myScript.js'
data = urllib2.urlopen(url).read()
add_data = re.findall('new Array(.*?)\);', data, re.MULTILINE|re.DOTALL)

The above code will give you all the arrays in the javascript file into add_data list. You can again use re to get the addresses. For ex. below line gives you all hyderabad addresses. This can be optimized according to your requirement

hyd_adds = re.findall('"(.*?)"', add_data[2])
Headrun
  • 129
  • 1
  • 11
0

There are multiple options here:

  • use regular expressions to extract data directly from the javascript
  • use javascript parser to extract data directly from the javascript (e.g. slimit - example here)
  • use ScrapyJS package with Splash rendering javascript
  • let a real browser execute javascript with the help of selenium - a browser could be headless (like PhantomJS)

If you would choose to use regular expressions, here is how you can make a dictionary of state -> list of pharmacies:

from pprint import pprint
import re

import requests


url = 'http://www.heteropharmacy.com/jScript/myScript.js'
with requests.Session() as session:
    response = session.get(url)

    pattern = re.compile(r"states_arr\['(\w+)'\]= new Array\((.*?)\);", re.MULTILINE | re.DOTALL)

    results = {state: [item.strip()[1:] for item in pharmacies.split('",')]
               for state, pharmacies in pattern.findall(response.content)}

    pprint(results)

Prints:

{'Chennai': ['Adambakkam # 044 22530209 # Opp. Murugan Temple, ; Brindavan Nager, ; Mohanpuri - 5th Street, ; Adambakkam, Chennai \x96 600 088',
             'Adambakkam - 2 # 044 - 22553195, 64540549 # No. 2 B, Ground Floor, Ganesh Nagar Main Road, ; Near NGO Colony Bus Stop, Telephone Colony, ; Adambakkam, Chennai - 600088.',
             'Allapakkam # 044- 64520024 # New No.131, Old No.10 M, ; Shop No. F, Alapakkam Main Road, ; Near Jeva Complex, Alapakkam, Chennai-16.',
             'Anna nagar # 044-26220891 # New No.1, AI Block, Second Street, ; Near Anna Adarsh College for Women, ; Shanthi Colony, Anna Nagar, ; Chennai- 600040.',
 ...
 'Visakhapatnam': ['Adarsh Nagar # 9247001943 # H. No. 3-352, Beside Andhra Bank, ; Near Manapuram Finance Ltd. Adarsha Nagar, ; Old Dairy Form, Visakhapatnam',
                   'B.C. Road, Gajuwaka # 0891 2546005  # D. No.13-6-14/1, ; Opp. Dr. T. Dhanalatha Hospital, ; B. C. Road, Gajuwaka, Visakhapatnam.',
                   'Chinawaltair # 0891-2546001, 6464501 # D.No: 6-5-3, Opp. Jaganadh Temple, ; China Waltair, Visakapatnam-17.',
                   'Marripalem #  9247000573 # D. No. 38-40-70,  Opp. Ramalingeswara Alayam, ; Marripalem Main Road, ; Marripalem, Visakhapatnam.',
                   'Muralinagar #  0891-6464507# D.No.39-8-9/5, ; Varma Complex, 48th Bus stop, ; Murali Nagar, Visakhapatnam',
                   'NRI Hospital # 0891-2714453, 6464506 # 50-27-16, Rammahon Chamber, ; Near NRI Hospital, ; Seethammadhara, Visakapatnam.',
                   'Pedawaltair # 0891-2546006 # H.No.8-1-97/2/2, ; Near Vishaka Eye Hospital, ; Pedawaltair junction, Vizag.',
                   'Ramnagar # 0891-2546002, 6464502 # D.No. 10-50-11/2, 1st Floor, ; Beside Care Hospital, Main Road, ; Ramnagar, Visakapatnam.',
                   'Seetammadhara # 0891-2713706, 6464504 # H.No: 55-14-109/1, ; Beside Sri Sivaramareddy Sweets, ; Opp to E- Seva kendram, ; Seetammadhara, Visakhapatnam."']}
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195