24

I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup.

As some of the pages have <meta name="Description and some have <meta name="description.

My problem is very much similar to that of Question on Stackoverflow

The only difference is that I can't use lxml .. I have to stick with Beautifulsoup.

Community
  • 1
  • 1
Nitin
  • 738
  • 6
  • 19

6 Answers6

18

You can give BeautifulSoup a regular expression to match attributes against. Something like

soup.findAll('meta', name=re.compile("^description$", re.I))

might do the trick. Cribbed from the BeautifulSoup docs.

Will McCutchen
  • 13,047
  • 3
  • 44
  • 43
  • Note, for this to work you'll also need to import regular expressions with this line at the top: `import re` – drpawelo Apr 30 '23 at 21:58
17

A regular expression? Now we have another problem.

Instead, you can pass in a lambda:

soup.findAll(lambda tag: tag.name.lower()=='meta',
    name=lambda x: x and x.lower()=='description')

(x and avoids an exception when the name attribute isn't defined for the tag)

MikeyB
  • 3,288
  • 1
  • 27
  • 38
  • Using bs4 i'm getting "find_all() got multiple values for keyword argument 'name'" with that :/ – Joaolvcm Feb 20 '14 at 11:14
  • @Joaolvcm “You [can’t use](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments) a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the attrs argument.” TL;DR: `soup.find_all(lambda tag: ..., {"name": lambda x: ...})`. – Alex Shpilkin Sep 21 '18 at 14:28
10

With minor changes it works.

soup.findAll('meta', attrs={'name':re.compile("^description$", re.I)})
Nitin
  • 738
  • 6
  • 19
7

With bs4 use the following:

soup.find('meta', attrs={'name': lambda x: x and x.lower()=='description'})
Emmanuel
  • 71
  • 1
  • 2
2

Better still use a css attribute = value selector with i argument for case insensitivity

soup.select('meta[name="description" i]')
ashleedawg
  • 20,365
  • 9
  • 72
  • 105
QHarr
  • 83,427
  • 12
  • 54
  • 101
-6

change case of the html page source. Use functions such as string.lower(), string.upper()

Lucifer
  • 1
  • 1
  • 4