9

I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library.

I'd like to be able to say doc.cssselect('meta[name=description]') (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D.

I'm currently using this as a workaround, but it's horrible!

for meta in doc.cssselect('meta'):
    name = meta.get('name')
    content = meta.get('content')

    if name and content:
        if name.lower() == 'keywords':
            keywords = content
        if name.lower() == 'description':
            description = content

It seems that the tag name meta is treated case-insensitively, but the attributes are not. It would be even more annoying meta was case-sensitive too!

Mat
  • 82,161
  • 34
  • 89
  • 109

3 Answers3

9

Values of attributes must be case-sensitive.

You can use arbitrary regular expression to select an element:

#!/usr/bin/env python
from lxml import html

doc = html.fromstring('''
    <meta name="Description">
    <meta name="description">
    <META name="description">
    <meta NAME="description">
''')
for meta in doc.xpath('//meta[re:test(@name, "^description$", "i")]',
                      namespaces={"re": "http://exslt.org/regular-expressions"}):
    print html.tostring(meta, pretty_print=True),

Output:

<meta name="Description">
<meta name="description">
<meta name="description">
<meta name="description">
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

lxml is an XML parser. XML is case-sensitive. You are parsing HTML, so you should use an HTML parser. BeautifulSoup is very popular. Its only drawback is that it can be slow.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 2
    `lxml.html`, `lxml.html.soupparser`, and `lxml.html.html5parser` provide HTML parsers. – jfs Nov 14 '09 at 13:42
  • 1
    BeautifulSoup barfs on the markup in a lot of pages, particularly with Javascript containing strings with tags inside them. lxml does not, hence why I wanted to use lxml. – Mat Nov 14 '09 at 14:53
  • 1
    @Mat: [Beautiful Soup 4 can use `lxml` as a parser](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser). – jfs May 07 '13 at 00:28
0

You can use

doc.cssselect.xpath("//meta[translate(@name,
    'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz')='description']")

It translates the value of "name" to lowercase and then matches.

See also:

Community
  • 1
  • 1
Piotr Migdal
  • 11,864
  • 9
  • 64
  • 86
  • wouldn't you only need to translate the letter in 'description' (or whatever the value you're comparing to is)? ...`"//meta[translate(@name, 'DESCRIPTON', 'descripton')='description']"` – katy lavallee May 10 '18 at 21:22