4

I want to scan some websites and would like to get all the java script files names and content.I tried python requests with BeautifulSoup but wasn't able to get the scripts details and contents.am I missing something ?

I have been trying lot of methods to find but I felt like stumbling in the dark. This is the code I am trying

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
Aravind Krishna
  • 159
  • 1
  • 3
  • 10
  • I tried requests with beautifulSoup.I cant give specific class name for scanning because it all varies from site to site.Identifying files as javascript itself is my requirment. – Aravind Krishna Feb 29 '16 at 05:33
  • ​​​​​​​​​​​​​​​What's your code? Could you [edit] your question and add a [mcve] please? Do you mean that get `src` from all the ` – Remi Guan Feb 29 '16 at 05:43
  • @KevinGuan I tried lot of methods I dont remember at all and not worthy to write here.I edited the questsion and wrote until i believe its clear about my path – Aravind Krishna Feb 29 '16 at 05:51
  • @AravindKrishna: ​​​​​​​​​​​​​​​Hmm...as I asked: What's the expected output then? Are you trying to get all the JavaScript code from that page? – Remi Guan Feb 29 '16 at 05:53
  • @KevinGuan all the javascript files names and contents. for example like jquery script is used or not.yes all the javascript code as well as the names of javascript file. – Aravind Krishna Feb 29 '16 at 05:55

2 Answers2

6

You can get all the linked JavaScript code use the below code:

l = [i.get('src') for i in soup.find_all('script') if i.get('src')] 
  • soup.find_all('script') returns a list of all the <script> tags in the page.

  • A list comprehension is used here to loop over all the elements in the list which returned by soup.find_all('script').

  • i is a dict like object, use .get('src') to check if it has src attribute. If not, ignore it. Otherwise, put it into a list (which's called l in the example).

The output, in this case looks like below:

['http://adserver.adtech.de/addyn/3.0/1602/5506153/0/6490/ADTECH;loc=700;target=_blank;grp=[group]',
 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
 'http://js.genieessp.com/t/057/794/a1057794.js',
 'http://ib.adnxs.com/ttj?id=5620689&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
 'http://ib.adnxs.com/ttj?id=5531763',
 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
 'http://xp2.zedo.com/jsc/xp2/fo.js',
 'http://www.marunadanmalayali.com/js/mnmads.js',
 'http://www.marunadanmalayali.com/js/jquery-2.1.0.min.js',
 'http://www.marunadanmalayali.com/js/jquery.hoverIntent.minified.js',
 'http://www.marunadanmalayali.com/js/jquery.dcmegamenu.1.3.3.js',
 'http://www.marunadanmalayali.com/js/jquery.cookie.js',
 'http://www.marunadanmalayali.com/js/swanalekha-ml.js',
 'http://www.marunadanmalayali.com/js/marunadan.js?r=1875',
 'http://www.marunadanmalayali.com/js/taboola_home.js',
 'http://d8.zedo.com/jsc/d8/fo.js']

My code missed some links because they're not in the HTML source actually.

You can see them in the console:

Chrome console

But they're not in the source:

HTML source

Usually, that's because these links were generated by JavaScript. And the requests module doesn't run any JavaScript in the page like a real browser - it only send a request to get the HTML source.

If you also need them, you have to use another module to run the JavaScript in that page, and you can see these links then. For that, I'd suggest use selenium - which runs a real browser so it can runs JavaScript in the page.

For example (make sure that you have already installed selenium and a web driver for it):

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()  # use Chrome driver for example
driver.get('http://www.marunadanmalayali.com/')

soup = BeautifulSoup(driver.page_source, "html.parser")
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]

__import__('pprint').pprint(l)
Community
  • 1
  • 1
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
  • Thanks a lot.I think now my question makes complete sense."javascript file name and its contents in python _**perfectly**_". perfect should be renamed as to get async scripts. – Aravind Krishna Feb 29 '16 at 10:38
1

You can use a select with script[src] which will only find script tags with a src, you don't need to call .get multiple times:

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)

src = [sc["src"] for sc in  soup.select("script[src]")]

You can also specify src=True with find_all to do the same:

src = [sc["src"] for sc in soup.find_all("script",src=True)]

Which will both give you the same output:

['http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://js.genieessp.com/t/052/954/a1052954.js', '//s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', 'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', 'http://www.marunadanmalayali.com/js/mnmcombined2.min.js']

Also if you use selenium, you can use it with PhantomJs for headless browsing, you don't need beautufulSoup at all if you use selenium, you can use the same css selector directly in selenium:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://www.marunadanmalayali.com/')

src = [sc.get_attribute("src") for sc in driver.find_elements_by_css_selector("script[src]")]
print(src)

Which gives you all the links:

u'https://pixel.yabidos.com/fltiu.js?qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&nci=&adtg=96331&nai=', u'http://gum.criteo.com/sync?c=72&r=2&j=TRC.getRTUS', u'http://b.scorecardresearch.com/beacon.js', u'http://cdn.taboola.com/libtrc/impl.201-1-RELEASE.js', u'http://p165.atemda.com/JSAdservingMP.ashx?pc=1&pbId=165&clk=&exm=&jsv=1.84&tsv=2.26&cts=1459160775430&arp=0&fl=0&vitp=0&vit=&jscb=&url=&fp=0;400;300;20&oid=&exr=&mraid=&apid=&apbndl=&mpp=0&uid=&cb=54613943&pId0=64056124&rank0=1&gid0=64056124:1c59ac&pp0=&clk0=[External%20click-tracking%20goes%20here%20(NOT%20URL-encoded)]&rpos0=0&ecpm0=&ntv0=&ntl0=&adsid0=', u'http://cdn.taboola.com/libtrc/marunadanaalayali-network/loader.js', u'http://s.atemda.com/Admeta.js', u'http://www.google-analytics.com/analytics.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://js.genieessp.com/t/052/954/a1052954.js', u'http://s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7219/9/fm.js?c=7219&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1936&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.025054786819964647&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://cas.criteo.com/delivery/ajs.php?zoneid=308686&nodis=1&cb=38688817829&exclude=undefined&charset=UTF-8&loc=http%3A//www.marunadanmalayali.com/', u'http://ads.pubmatic.com/AdServer/js/showad.js', u'http://showads.pubmatic.com/AdServer/AdServerServlet?pubId=135167&siteId=135548&adId=600924&kadwidth=300&kadheight=250&SAVersion=2&js=1&kdntuid=1&pageURL=http%3A%2F%2Fwww.marunadanmalayali.com%2F&inIframe=0&kadpageurl=marunadanmalayali.com&operId=3&kltstamp=2016-3-28%2011%3A26%3A13&timezone=1&screenResolution=1024x768&ranreq=0.8869257988408208&pmUniAdId=0&adVisibility=2&adPosition=999x664', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7213/9/fm.js?c=7213&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1948&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.08655649935826659&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://ib.adnxs.com/ttj?ttjb=1&bdc=1459160761&bdh=ZllBLkzcj2dGDVPeS0Sw_OTWjgQ.&tpuids=eyJ0cHVpZHMiOlt7InByb3ZpZGVyIjoiY3JpdGVvIiwidXNlcl9pZCI6Il9KRC1PUmhLX3hLczd1cUJhbjlwLU1KQ2VZbDQ2VVUxIn1dfQ==&view_iv=0&view_pos=664,2096&view_ws=400,300&view_vs=3&bdref=http%3A%2F%2Fwww.marunadanmalayali.com%2F&bdtop=true&bdifs=0&bstk=http%3A%2F%2Fwww.marunadanmalayali.com%2F&&id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', u'http://www.marunadanmalayali.com/js/mnmcombined2.min.js', u'http://pixel.yabidos.com/iftfl.js?ver=1.4.2&qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&adtg=96331&nci=&nai=&nsi=&cstm1=&cstm2=&cstm3=&kqt=&xc=&test=&od1=&od2=&co=0&tps=34&rnd=3m17uji8ftbf']
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321