Xpath expression to get an optional element in Scrapy

Question

By optional, I mean the element could not exist.

I have a spider for GitHub, I am trying to get the primary Language for a rep

<div class="repository-lang-stats">
    <ol class="repository-lang-stats-numbers">
      <li>
          <a href="/scrapy/scrapy/search?l=python">
            <span class="color-block language-color" style="background-color:#3581ba;"></span>
            <span class="lang">Python</span>
            <span class="percent">99.1%</span>
          </a>
      </li>
      <li>
          <span class="other">
            <span data-lang="Other" class="color-block language-color"></span>
            <span class="lang">Other</span>
            <span class="percent">0.9%</span>
          </span>
      </li>
    </ol>
</div>

In the example (source of this repo) above I need to get "Python", from first

<span class="lang">

But my problem is for some repo, like an empty one, there is no

<span class="lang">

tag, or

<ol class="repository-lang-stats-numbers">

tag. How do I get over this?

solved by adding code to exception block. – gherkin Apr 02 '14 at 02:45 — gherkin, Apr 02 '14 at 02:45

score 1 · Accepted Answer · edited May 23 '17 at 11:50

I'd go for finding the list of languages, take the first list item and retrieve the first span, jumping over possible anchor tags (they seem to be missing for some low-frequency languages).

//ol[@class="repository-lang-stats-numbers"]/li[1]//span[@class="lang"]

An empty result will indicate that no language data is available.

Some remarks:

To be more specific, you could prepend div[@class="repository-lang-stats"] as first axis step, but I don't think it will be necessary.
We're matching class attributes, watch out!
To return only the text value, append /text() to the query.

Anyway: Github offers an API that also lets you query repository languages. Better use this instead of scraping the site. APIs are fast, easy to use and stable; web sites are front end code that change often and will break your XPath queries.

You can query it by accessing a special URI (for example https://api.github.com/repos/scrapy/scrapy/languages) that will return a JSON object that can be easily parsed and sorted:

{
  "Shell": 1733,
  "Python": 1195439,
  "CSS": 9681
}

score 0 · Answer 2 · answered Mar 29 '14 at 13:14

0

The xpath is div/ol/li/a/span[@class="lang"]/text(). It will return nothing in case anything along the path is missing.

answered Mar 29 '14 at 13:14

Jonas Bötel

4,452
1
19
28

Xpath expression to get an optional element in Scrapy

2 Answers2