1

On the website http://www.apkmirror.com/apk/redditinc/reddit/reddit-1-5-5-release/reddit-1-5-5-android-apk-download/, I'm trying to extract the lines containing the Min: and Target: versions of Android (see screenshot below).

enter image description here

In the Scrapy shell, so far I've come up with the XPath expression

In [1]: android_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]')

such that if I concatenate with .//text() and extract(), I get several lines including the ones I want:

In [2]: android_version_text = android_version.xpath('.//text()').extract()

In [3]: android_version_text
Out[3]: 
[u'\n',
 u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ',
 u'\n',
 u'Target: Android 6.0 (Marshmallow, API 23)',
 u'\n']

I would now like to refine the XPath expression to get only fields with text() containing "Min:" or "Target:. Following XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode, I've tried

In [7]: android_version.xpath('.//*[contains(text(), "Min:"]')

but this gives rise to a

ValueError: XPath error: Invalid expression in .//*[contains(text(), "Min:"]

How could I construct an XPath expression to get only the Min: line, for example?

Community
  • 1
  • 1
Kurt Peek
  • 52,165
  • 91
  • 301
  • 526

1 Answers1

0

Following https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/, I came up with the following:

In [12]: android_min_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]//text()[starts-with(., "Min:")]')

In [13]: android_min_version.extract()
Out[13]: [u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ']

in short, to filter the text you want you do an ordinary //text() followed by a [contains(., "target_string")], where "target_string" is the string you are searching. (Here I have also used starts-with instead of contains).

Kurt Peek
  • 52,165
  • 91
  • 301
  • 526