Using beautiful soup to clean up scraped HTML from scrapy

Question

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()

[
 u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&amp;rep=rep1&amp;type=pdf"><b>Python </b>Paradigms for XML</a>', 
 u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>', 
 u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>', 
 u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>', 
 u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>', 
 u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>', 
 u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>', 
 u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>', 
 u'<a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=El4TAgAAQBAJ&amp;oi=fnd&amp;pg=PT8&amp;dq=python+xpath&amp;ots=RrFv0f_Y6V&amp;sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>&amp; XML</a>', 
 u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF Clients</a>'
]

As you can see, this output is raw HTML that needs cleaning. I now have a good sense of how to clean this HTML up. The simplest way is probably to just BeautifulSoup and try something like:

t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

This is based off an earlier SO question. The regexp version has been suggested, but I am guessing that BeautifulSoup will be more robust.

I'm a scrapy n00b and can't figure out how to embed this in my spider. I tried

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholar"
    allowed_domains = ["scholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
        soup = BeautifulSoup(t)
        text_parts = soup.findAll(text=True)
        text = ''.join(text_parts)
        item['title'] = text
        return(item)

But that didn't quite work. Any suggestions would be helpful.

Edit 3: Based on suggestions, I have modified my spider file to:

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "dmoz"
    allowed_domains = ["sholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        titles = sel.xpath('//h3[@class="gs_rt"]/a')

        for title in titles:
            title = item.xpath('.//text()').extract()
            print "".join(title)

However, I get the following output:

2014-02-17 15:11:12-0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: scholarscrape)
2014-02-17 15:11:12-0800 [scrapy] INFO: Optional features available: ssl, http11
2014-02-17 15:11:12-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME': 'scholarscrape'}
2014-02-17 15:11:12-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider opened
2014-02-17 15:11:13-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-17 15:11:13-0800 [dmoz] DEBUG: Crawled (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml> (referer: None)
2014-02-17 15:11:13-0800 [dmoz] ERROR: Spider error processing <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>
 Traceback (most recent call last):
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
     self.runUntilCurrent()
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
     self._startRunCallbacks(result)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py", line 20, in parse
     title = item.xpath('.//text()').extract()
   File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__
     raise AttributeError(name)
 exceptions.AttributeError: xpath

2014-02-17 15:11:13-0800 [dmoz] INFO: Closing spider (finished)
2014-02-17 15:11:13-0800 [dmoz] INFO: Dumping Scrapy stats:
 {'downloader/request_bytes': 247,
  'downloader/request_count': 1,
  'downloader/request_method_count/GET': 1,
  'downloader/response_bytes': 108851,
  'downloader/response_count': 1,
  'downloader/response_status_count/200': 1,
  'finish_reason': 'finished',
  'finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),
  'log_count/DEBUG': 3,
  'log_count/ERROR': 1,
  'log_count/INFO': 7,
  'response_received_count': 1,
  'scheduler/dequeued': 1,
  'scheduler/dequeued/memory': 1,
  'scheduler/enqueued': 1,
  'scheduler/enqueued/memory': 1,
  'spider_exceptions/AttributeError': 1,
  'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider closed (finished)

Edit 2: My original question was quite different, but I am now convinced that this is the right way to proceed. Original question (and first edit below):

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link:

http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('string(//h3[@class="gs_rt"]/a)').extract()
[u'Python Paradigms for XML']

As you can see, this only selects the first title, and none of the others on the page. I can't figure out what I should modify my XPath to, so that I select all such elements on the page. Any help is greatly appreciated.

Edit 1: My first approach was to try

>>> sel.xpath('//h3[@class="gs_rt"]/a/text()').extract()
[u'Paradigms for XML', u'NCClient: A ', u'Library for NETCONF Clients', 
 u'PALSE: ', u'Analysis of Large Scale (Computer) Experiments', u'and XML', 
 u'drx: ', u'Programming Language [Computers: Programming: Languages: ',
 u']-loadaverageZero', u'XML and ', u'Tutorial', 
 u'Zato\u2014agile ESB, SOA, REST and cloud integrations in ', 
 u'XML Processing with Perl, ', u', and PHP', u'& XML', u'A ', 
 u'Module for NETCONF Clients']

The problem with this is approach is that if you look at the actual Google Scholar page, you will see that the first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as scrapy returns. My guess for this behaviour is that 'Python' is trapped inside tags which is why text() is not doing what we want him to do.

Why are you casting it to string? I'd remove that first, or go for `//h3[@class="gs_rt"]/a/text()` — Wrikken, Feb 17 '14 at 21:49
@Wrikken, that was my first attempt! However, take the first entry as an example. When I try your suggested approach, I get only `Paradigms for XML' instead of `Python Paradigms for XML'. I guess this is because `Python' is trapped inside tags, and text() does not pick it up. Does that make any sense? (Question edited) — krishnan, Feb 17 '14 at 21:57
Well, you can get the separate nodes with `//h3[@class="gs_rt"]/a//text()` of course, but I take it you want to cast the whole content of that `/a` to _one_ string? — Wrikken, Feb 17 '14 at 22:07
@Wrikken, yes, that's getting closer to a solution. However, as you rightly said, I want them to be in the same string. — krishnan, Feb 17 '14 at 22:09
Hm, I don't think XPath itself is really suited for it. I agree with Tomalak's remark it's likely better to find the `/a`'s and just get their text-content in application code. I say likely better: what I mean is I wouldn't know how to make XPath behave as you want ;) — Wrikken, Feb 17 '14 at 22:15
Thanks @Wrikken, I've changed the question around to reflect my needs. Think it's clear enough? I've got some idea but not enough to get my code working. — krishnan, Feb 17 '14 at 22:53

score 4 · Accepted Answer · answered Feb 17 '14 at 22:58

4

This is a really interesting and rather difficult question. The problem you're facing concerns the fact that "Python" in the title is in bold, and it is treated as node, while the rest of the title is simply a text, therefore text() extracts only textual content and not content of <b> node.

Here's my solution. First get all the links:

titles = sel.xpath('//h3[@class="gs_rt"]/a')

then iterate over them and select all textual content of each node, in other words join <b> node with text node for each children of this link

for item in titles:
    title = item.xpath('.//text()').extract()
    print "".join(title)

This works because in a for loop you will be dealing with textual content of children of each link and thus you will be able to join matching elements. Title in the loop will be equal for instance :[u'Python ', u'Paradigms for XML'] or [u'NCClient: A ', u'Python ', u'Library for NETCONF Clients']

answered Feb 17 '14 at 22:58

Pawel Miech

7,742
4
36
57

Hi @Pawelmhm, thanks! I tried this, but I get the following error: `Traceback (most recent call last): File "", line 2, in File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__ raise AttributeError(name) AttributeError: xpath` Any suggestions? – krishnan Feb 17 '14 at 23:02
don't extract links, just use selectors, ```titles = sel.xpath('//h3[@class="gs_rt"]/a')``` and not ```titles = sel.xpath('//h3[@class="gs_rt"]/a').extract()``` – Pawel Miech Feb 17 '14 at 23:03
if by that you mean do `sel.xpath('//h3[@class="gs_rt"]/a')` instead of `sel.xpath('//h3[@class="gs_rt"]/a').extract()`, I did that. From what I can tell, I followed your code exactly – krishnan Feb 17 '14 at 23:05
Hey @Pawelmhm weird beans! Let me edit the question and put up the new spider file. Maybe I'm being silly about something obvious. (edits are in now) – krishnan Feb 17 '14 at 23:08
Hey @Pawelmhm, I figured out what was wrong -- I already had something called item :P your code works now when I invoke print. But if I try to assign it to a scrapy item, then I only get the very last line. ie. `f = open('test.txt','w') for thing in titles: title = thing.xpath('.//text()').extract() item['title'] = "".join(title) return item` gives me only `{'title': u'pymzML\u2014Python module for high-throughput bioinformatics on mass spectrometry data'}`. Do you know what I'm doing wrong? – krishnan Feb 17 '14 at 23:43
@krishnan Nothing. That's a Unicode escape sequence you see because Python will only print ANSI characters to screen during debugging. [Write them to a UTF-8-encoded file](http://stackoverflow.com/q/5483423/18771), for example, and they will appear as actual characters. – Tomalak Feb 18 '14 at 07:34
Sorry everyone, I grossly overcomplicated the affair, but in the end @Pawelmhm was 100% right. I just needed some time to understand that :-) – krishnan Feb 19 '14 at 20:40

score 0 · Answer 2 · answered Feb 17 '14 at 21:52

0

The XPath string() function only returns the string representation of the first node you pass to it.

Just extract nodes normally, don't use string().

sel.xpath('//h3[@class="gs_rt"]/a').extract()

or

sel.xpath('//h3[@class="gs_rt"]/a/text()').extract()

answered Feb 17 '14 at 21:52

Tomalak

332,285
67
532
628

Hey @Tomalak, I just added a comment to a question in the comments above that I think highlights why I cannot follow the text() approach. (Question edited) – krishnan Feb 17 '14 at 21:58
XPath selects *nodes*. Scrapy currently supports version 1.0, where you cannot do a whole lot more than that. Select the nodes you need (presumably the `` elements) and process them in a second step. *(With the more versatile XPath 2.0 you could do `//h3[@class="gs_rt"]/a/string()`, but that won't work with scrapy.)* – Tomalak Feb 17 '14 at 22:03
Hey @Tomalak thanks. Unfortunately, I'm not quite sure what the right way to proceed now should be. text() is no good. I could extract without using text() but then I have all this HTML gunk that I don't know what to do with. Ideally, I don't want to clean this stuff on a case by case basis. Any suggestions for a neater way to do it? – krishnan Feb 17 '14 at 22:04
As I said. Extract the `a` elements and process them with Python. Getting their text value should not be too hard. – Tomalak Feb 17 '14 at 22:08
Sorry @Tomalak, I should have been clearer. I understand that is the way forward. However, I'm a python, xpath n00b and don't know what the most straight forward way to clean up is. Can you point me to the tools that might do the job? – krishnan Feb 17 '14 at 22:12
I currently don't have scrapy installed, so I can't give you the exact step-by-step solution. I suppose you will have to read the documentation and play around. `for node in sel.xpath('//h3[@class="gs_rt"]/a'):` is a good start, I think. – Tomalak Feb 17 '14 at 22:29
Thanks @Tomalak, I've changed the question around to reflect my needs. Think it's clear enough? I've got some idea but not enough to get my code working. – krishnan Feb 17 '14 at 22:52

score 0 · Answer 3 · answered Mar 14 '22 at 16:44

The first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as Scrapy returns.

You need to use normalize-space() that will return blank nodes as well because text() will ignore blank text nodes. So your initial XPath would look like this:

sel.xpath('//h3[@class="gs_rt"]/a').xpath("normalize-space()").extract()

Example:

# HTML has been simplified

from parsel import Selector

html = '''
<a href="https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf"><span class=gs_ctg2>[PDF]</span> erdincuzun.com</a>
<div class="gs_ri">
    <h3 class="gs_rt">
        <span class="gs_ctc"><span class="gs_ct1">[PDF]</span><span class="gs_ct2">[PDF]</span></span>
        <a href="https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf">Comparison of <b>Python </b>libraries used for Web
            data extraction</a>
'''

selector = Selector(HTML)

# get() to get textual data
print("Without normalize-space:\n", selector.xpath('//*[@class="gs_rt"]/a/text()').get())
print("\nWith normalize-space:\n", selector.xpath('//*[@class="gs_rt"]/a').xpath("normalize-space()").get())


"""
Without normalize-space:
 Comparison
        of 

With normalize-space:
 Comparison of Python libraries used for Web data extraction
"""

Actual code and example in the online IDE to get titles from Google Scholar Organic results:

import scrapy

class ScholarSpider(scrapy.Spider):
    name = "scholar_titles"
    allowed_domains = ["scholar.google.com"]
    start_urls = ["https://scholar.google.com/scholar?q=intitle%3Apython+xpath"]

    def parse(self, response):
        for quote in response.xpath('//*[@class="gs_rt"]/a'):
            yield {
                "title": quote.xpath("normalize-space()").get()
            }

Run it:

$ scrapy runspider -O <file_name>.jl <file_name>.py

-O stands for override. -o for appending.
jl is a JSON line file format.

Output with normalize-space():

{"title": "Comparison of Python libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with python"}
{"title": "News crawling based on Python crawler"}
{"title": "A survey on python libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on Python"}
{"title": "DECEPTIVE SECURITY USING PYTHON"}
{"title": "Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others"}
{"title": "Python Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy"}
{"title": "XML processing with Python"}

Output without normalize-space():

{"title": "Comparison of "}
{"title": "libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with "}
{"title": "News crawling based on "}
{"title": "crawler"}
{"title": "A survey on "}
{"title": "libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on "}
{"title": "DECEPTIVE SECURITY USING "}
{"title": "Hands-On Web Scraping with "}
{"title": ": Perform advanced scraping operations using various "}
{"title": "libraries and tools such as Selenium, Regex, and others"}
{"title": "Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using "}
{"title": "And Scrapy"}
{"title": "XML processing with "}

Alternatively, you can achieve it with Google Scholar Organic Results API from SerpApi.

It's a paid API with a free plan. You don't have to figure out the extraction part and maintain it, how to scale it and how to bypass blocks from search engines since it's already done for the end-user.

Example code to integrate:

from serpapi import GoogleScholarSearch

params = {
  "api_key": "Your SerpApi API key",  # API Key
  "engine": "google_scholar",         # parsing engine
  "q": "intitle:python XPath",        # search query
  "hl": "en"                          # language
}

search = GoogleScholarSearch(params)  # where extraction happens on the SerpApi back-end
results = search.get_dict()           # JSON -> Python dictionary

for result in results["organic_results"]:
    title = result["title"]
    print(title)

Output:

Comparison of Python libraries used for Web data extraction
Approaching the largest 'API': extracting information from the internet with python
News crawling based on Python crawler
A survey on python libraries used for social media content scraping
Design and Implementation of Crawler Program Based on Python
DECEPTIVE SECURITY USING PYTHON
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Python Paradigms for XML
Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy
XML processing with Python

Disclaimer, I work for SerpApi.

Using beautiful soup to clean up scraped HTML from scrapy

3 Answers3

Linked