Split a html page at specific nodes using Scrapy

Question

I often want to partition a scraped HTML page by splitting it at specific nodes. How can this be done with Scrapy and Python?

Example

I want a function split() that split some HTML response

<h1>Heading 1</h1>
<p>Paragraph for heading</p>
<h2>Section 1</h2>
<h3>Subsection 1</h3>
<h2>Section 2</h2>
<p>Paragraph 1 for Section 2</p>
<p>Paragraph 2 for Section 2</p>
<h3>Subsection 2</h3>

at each heading, resulting in the following output -- depicted as list of list of string, but other data structures (e.g. iterator over SelectorList) would be fine, too:

[
  ["<h1>Heading 1</h1>", "<p>Paragraph for heading</p>"],
  ["<h2>Section 1</h2>"],
  ["<h3>Subsection 1</h3>"],
  ["<h2>Section 2</h2>", "<p>Paragraph 1 for Section 2</p>", "<p>Paragraph 2 for Section 2</p>"],
  ["<h3>Subsection 2</h3>"]
]

As input, split() needs the response and the nodes specifying where to split the response. Ideally, these nodes are given as SelectorList, so that they can be specified via XPath, e.g.

split(response, response.xpath("//*[self::h1 or self::h2 or self::h3]"))

Could XPath select all elements between two specific elements be generalized to perform such a split at given nodes?

Split a html page at specific nodes using Scrapy

Example

0 Answers0