Getting attribute's value using BeautifulSoup

Question

I'm writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :

<script type="text/javascript" src="http://example.com/something.js"></script>

and

<script>some JS</script>

I'm able to get the JS from the second scenario, that is when the JS is written within the tags.

But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)

Here's my code

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print n

Output : Some JS

Output Needed : http://example.com/something.js

If you are satisfied with the answer, please do accept the answer you are satistified with. — Venkateshwaran Selvaraj, Nov 22 '13 at 11:05

score 26 · Accepted Answer · edited May 27 '15 at 04:55

It will get all the src values only if they are present. Or else it would skip that <script> tag

from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
 print source['src']

I am getting following two src values as result

http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js

I guess this is what you want. Hope this is useful.

score 5 · Answer 2 · answered Sep 11 '13 at 05:16

5

Get 'src' from script node.

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print "src:", n.get('src') <====

answered Sep 11 '13 at 05:16

rajpy

2,436
5
29
43

getting an output 'None'. src: None src: None .. However if i do n.get('type') it shows me the result "text/javascript" Why is this issue with src? – aditya.gupta Sep 11 '13 at 05:38
Hmm..It should be working, I tried it in my system. What is the output of 'n'? – rajpy Sep 11 '13 at 09:07
The output is 'None' . – aditya.gupta Sep 12 '13 at 04:06

Ashok Fernandez · Answer 3 · 2013-09-11T10:33:11.907

This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script>  <script>some JS</script>'

soup = BeautifulSoup(html)

# Find all script tags 
for n in soup.find_all('script'):

    # Check if the src attribute exists, and if it does grab the source URL
    if 'src' in n.attrs:
        javascript = n['src']

    # Otherwise assume that the javascript is contained within the tags
    else:
        javascript = n.text

    print javascript

This output of this is

http://example.com/something.js
some JS

Getting attribute's value using BeautifulSoup

3 Answers3

Linked