96

I would like to get all the <script> tags in a document and then process each one based on the presence (or absence) of certain attributes.

E.g., for each <script> tag, if the attribute for is present do something; else if the attribute bar is present do something else.

Here is what I am doing currently:

outputDoc = BeautifulSoup(''.join(output))
scriptTags = outputDoc.findAll('script', attrs = {'for' : True})

But this way I filter all the <script> tags with the for attribute... but I lost the other ones (those without the for attribute).

LB40
  • 12,041
  • 17
  • 72
  • 107
  • 1
    "but the if ... in doesn't work"? What does that mean? Syntax error? What do you mean by "doesn't work"? Please be very specific on what's going wrong. – S.Lott Feb 16 '11 at 11:03
  • Do you want to test for the presence of an attribute in _any_ tag, _all_ tags or treat each occurrence of the tag separately? – Chinmay Kanchi Feb 16 '11 at 12:42

7 Answers7

145

If i understand well, you just want all the script tags, and then check for some attributes in them?

scriptTags = outputDoc.findAll('script')
for script in scriptTags:
    if script.has_attr('some_attribute'):
        do_something()        
Sadık
  • 4,249
  • 7
  • 53
  • 89
Lucas S.
  • 13,391
  • 8
  • 46
  • 46
  • i'm unable to do something like: if 'some_attribute' in script ? , that's what I'm after, and I want to avoid calling findAll again and again... – LB40 Feb 16 '11 at 14:20
  • 5
    For checking for available attributes you must use python dict methods, eg: script.has_key('some_attribute') – Lucas S. Feb 16 '11 at 14:29
  • 1
    how do I check if the tag has any attributes? While tag.has_key('some_attribute') works fine, tag.keys() throws an exception ('NoneType' object is not callable). – Georg Pfolz Apr 08 '13 at 14:00
  • 1
    found it: tag.attrs is the dictionary! – Georg Pfolz Apr 08 '13 at 14:18
  • 12
    Please update this post, has_key is deprecated. Use has_attr instead. – RvdK Mar 31 '14 at 15:02
  • 3
    sadly, did not work for me. Maybe this way `soup_response.find('err').string is not None` can be used for other attributes too... – im_infamous Aug 25 '18 at 14:03
47

You don't need any lambdas to filter by attribute, you can simply use some_attribute=True in find or find_all.

script_tags = soup.find_all('script', some_attribute=True)

# or

script_tags = soup.find_all('script', {"some-data-attribute": True})

Here are more examples with other approaches as well:

soup = bs4.BeautifulSoup(html)

# Find all with a specific attribute

tags = soup.find_all(src=True)
tags = soup.select("[src]")

# Find all meta with either name or http-equiv attribute.

soup.select("meta[name],meta[http-equiv]")

# find any tags with any name or source attribute.

soup.select("[name], [src]")

# find first/any script with a src attribute.

tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")

# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")

# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")

# find all tags with a name attribute that endwith foo
# or any src that ends with  whatever
soup.select("[name$=foo], [src$=whatever]")

You can also use regular expressions with find or find_all:

import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with 
soup.find_all("script", src=re.compile("whatever$"))
mihow
  • 334
  • 4
  • 13
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    I agree that this should be the accepted answer. I simplified the primary example to make it stand out more. – mihow Oct 30 '19 at 22:05
38

For future reference, has_key has been deprecated is beautifulsoup 4. Now you need to use has_attr

scriptTags = outputDoc.find_all('script')
  for script in scriptTags:
    if script.has_attr('some_attribute'):
      do_something()  
miah
  • 10,093
  • 3
  • 21
  • 32
20

If you only need to get tag(s) with attribute(s), you can use lambda:

soup = bs4.BeautifulSoup(YOUR_CONTENT)
  • Tags with attribute
tags = soup.find_all(lambda tag: 'src' in tag.attrs)

OR

tags = soup.find_all(lambda tag: tag.has_attr('src'))
  • Specific tag with attribute
tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)
  • Etc ...

Thought it might be useful.

SomeGuest
  • 209
  • 2
  • 2
3

you can check if some attribute are present

scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
    do_something()
Charles Ma
  • 31
  • 1
1

By using the pprint module you can examine the contents of an element.

from pprint import pprint

pprint(vars(element))

Using this on a bs4 element will print something similar to this:

{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
 'can_be_empty_element': False,
 'contents': [u'\n\t\t\t\tNESNA\n\t'],
 'hidden': False,
 'name': u'span',
 'namespace': None,
 'next_element': u'\n\t\t\t\tNESNA\n\t',
 'next_sibling': u'\n',
 'parent': <h1 class="pie-compoundheader" itemprop="name">\n<span class="pie-description">Bedside table</span>\n<span class="pie-productname size-3 name global-name">\n\t\t\t\tNESNA\n\t</span>\n</h1>,
 'parser_class': <class 'bs4.BeautifulSoup'>,
 'prefix': None,
 'previous_element': u'\n',
 'previous_sibling': u'\n'}

To access an attribute - lets say the class list - use the following:

class_list = element.attrs.get('class', [])

You can filter elements using this approach:

for script in soup.find_all('script'):
    if script.attrs.get('for'):
        # ... Has 'for' attr
    elif "myClass" in script.attrs.get('class', []):
        # ... Has class "myClass"
    else: 
        # ... Do something else
Adam Salma
  • 1,746
  • 1
  • 11
  • 22
1

A simple way to select just what you need.

outputDoc.select("script[for]")
Eat at Joes
  • 4,937
  • 1
  • 40
  • 40