I often want to partition a scraped HTML page by splitting it at specific nodes. How can this be done with Scrapy and Python?
Example
I want a function split()
that split some HTML response
<h1>Heading 1</h1>
<p>Paragraph for heading</p>
<h2>Section 1</h2>
<h3>Subsection 1</h3>
<h2>Section 2</h2>
<p>Paragraph 1 for Section 2</p>
<p>Paragraph 2 for Section 2</p>
<h3>Subsection 2</h3>
at each heading, resulting in the following output -- depicted as list of list of string, but other data structures (e.g. iterator over SelectorList
) would be fine, too:
[
["<h1>Heading 1</h1>", "<p>Paragraph for heading</p>"],
["<h2>Section 1</h2>"],
["<h3>Subsection 1</h3>"],
["<h2>Section 2</h2>", "<p>Paragraph 1 for Section 2</p>", "<p>Paragraph 2 for Section 2</p>"],
["<h3>Subsection 2</h3>"]
]
As input, split()
needs the response and the nodes specifying where to split the response. Ideally, these nodes are given as SelectorList
, so that they can be specified via XPath, e.g.
split(response, response.xpath("//*[self::h1 or self::h2 or self::h3]"))
Could XPath select all elements between two specific elements be generalized to perform such a split at given nodes?