Difference in fetching response elements : Absolute and relative XPath in Scrapy using Firebug and XPath Checker extensions

Question

This is probably very conceptual question to ask, and Stack overflow has a wealth of resources on scrapy and building Xpaths - but I did not find anything that answers this specifically, so asking.

While building my XPath expressions for Scrapy (in python) using Firebug & XPath checker (independently) - I see two different ways to build my Xpaths. I know that for a particular Xpath/HTML hierarchy, there can be many possible ways of building an XPath, to be able to extract/scrape elements of interest. I also understand that you may generate either an absolute/relative Xpath (in Firepath)

More Specifically -

Sample Usecase -- Trying to scrape a page on ebay

scrapy shell http://www.ebay.com/sch/Coats-Jackets-/57988/i.html

--Using Xpath checker-- [Works ok, after removing tbody from the XPath]]

Xpath = id('ResultSetItems')/table/tbody/tr/td/div/div/div/div/div/h4/a/text() hxs.select("id('ResultSetItems')/table/tr/td/div/div/div/div/div/h4/a/text()").extract()

-- Using relative path in Firepath -- [works, ok, after removing tbody from the XPath]

XPath = .//[@id='ResultSetItems']/table[1]/tbody/tr/td[1]/div/div/div/div/div[2]/h4/a/@href hxs.select(".//[@id='ResultSetItems']/table[1]/tr/td[1]/div/div/div/div/div[2]/h4/a/@href").extract()

-- Using absolute path in Firepath -- [Does not work, even after removing tbody from the XPath]

XPath = =html/body/div[5]/div[2]/div[3]/div[1]/div/div/div[2]/div/div[6]/div/table[1]/tbody/tr/td[1]/div/div/div/div/div[2]/h4/a/@href hxs.select("html/body/div[5]/div[2]/div[3]/div[1]/div/div/div[2]/div/div[6]/div/table[1]/t>r/td[1]/div/div/div/div/div[2]/h4/a/@href").extract() does not work, even after removing tbody

Note that I see the response only after I explicitly remove the "tbody" from XPath , but this does not hold true for absolute paths generated via Firepath.

Q1: Why do I need to remove "tbody" and if there are other such elements that firefox appends/inserts in the middle of the XPath, besides tbody that I should remove before trying to fetch responses(using hxs.select)/build my item pipeline.

A possible explanation I found : "Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use in your XPath expressions. " Source : Firefox, see also : Parsing HTML with XPath, Python and Scrapy

Q2: When reading an absolute path in FirePath pane, the response does not work even after removing tbody - Why is that so ?

Q3 : Is there a best practice on which of the two between Firebug & XPath checker works better(read more robust/consistent) - and if yes, why and which one ?

Q4 Unrelated : Some people recommend disabling Javascript on the browser while building your XPaths, is this related and is disabling the JavaScript a standard practice ? What are the repercussions of not doing so, while scraping (if any) ?

Related - Xpath Table Within Table Parsing HTML with XPath, Python and Scrapy

score 1 · Answer 1 · edited May 23 '17 at 11:55

Q1

Adding tbody tag by the browser is a way of following HTML4 specification:

<!ELEMENT TABLE - -
     (CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>
<!ATTLIST TABLE                        -- table element --
  %attrs;                              -- %coreattrs, %i18n, %events --
  summary     %Text;         #IMPLIED  -- purpose/structure for speech output--
  width       %Length;       #IMPLIED  -- table width --
  border      %Pixels;       #IMPLIED  -- controls frame width around table --
  frame       %TFrame;       #IMPLIED  -- which parts of frame to render --
  rules       %TRules;       #IMPLIED  -- rulings between rows and cols --
  cellspacing %Length;       #IMPLIED  -- spacing between cells --
  cellpadding %Length;       #IMPLIED  -- spacing within cells --
  >

In other words, tr element cannot be a direct child of table by the specification. A browser inserts tbody when it sees it's missing. HTML5, on the other hand, allows this to be. Browsers are just keeping it for backwards compatibility now.

Difference in fetching response elements : Absolute and relative XPath in Scrapy using Firebug and XPath Checker extensions

1 Answers1