0

By optional, I mean the element could not exist.

I have a spider for GitHub, I am trying to get the primary Language for a rep

<div class="repository-lang-stats">
    <ol class="repository-lang-stats-numbers">
      <li>
          <a href="/scrapy/scrapy/search?l=python">
            <span class="color-block language-color" style="background-color:#3581ba;"></span>
            <span class="lang">Python</span>
            <span class="percent">99.1%</span>
          </a>
      </li>
      <li>
          <span class="other">
            <span data-lang="Other" class="color-block language-color"></span>
            <span class="lang">Other</span>
            <span class="percent">0.9%</span>
          </span>
      </li>
    </ol>
</div>

In the example (source of this repo) above I need to get "Python", from first

<span class="lang">

But my problem is for some repo, like an empty one, there is no

<span class="lang">

tag, or

<ol class="repository-lang-stats-numbers">

tag. How do I get over this?

gherkin
  • 476
  • 7
  • 24

2 Answers2

1

I'd go for finding the list of languages, take the first list item and retrieve the first span, jumping over possible anchor tags (they seem to be missing for some low-frequency languages).

//ol[@class="repository-lang-stats-numbers"]/li[1]//span[@class="lang"]

An empty result will indicate that no language data is available.

Some remarks:

  • To be more specific, you could prepend div[@class="repository-lang-stats"] as first axis step, but I don't think it will be necessary.
  • We're matching class attributes, watch out!
  • To return only the text value, append /text() to the query.

Anyway: Github offers an API that also lets you query repository languages. Better use this instead of scraping the site. APIs are fast, easy to use and stable; web sites are front end code that change often and will break your XPath queries.

You can query it by accessing a special URI (for example https://api.github.com/repos/scrapy/scrapy/languages) that will return a JSON object that can be easily parsed and sorted:

{
  "Shell": 1733,
  "Python": 1195439,
  "CSS": 9681
}
Community
  • 1
  • 1
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
0

The xpath is div/ol/li/a/span[@class="lang"]/text(). It will return nothing in case anything along the path is missing.

Jonas Bötel
  • 4,452
  • 1
  • 19
  • 28