0

I am trying to set a Scrapy selector to fetch some data on a table from Trezor's supported coins page (https://trezor.io/coins/):

In [1]: import requests
   ...: from scrapy.selector import Selector
   ...: req = requests.get('https://trezor.io/coins/').content
   ...: xs = '//*[@id="content"]/tr'
   ...: sel = Selector(text=req).xpath(xs)

In [2]: sel.extract_first()
Out[2]: '<tr class="coin  " data-href="./#BTC" id="BTC"></tr>'

Shouldn't the selector bring the tr element and everything that is inside it (in this case, six td elements with more inner elements? When I try to access the td elements manually (with either xs = '//*[@id="content"]/tr[1]/td' or xs = '//*[@id="content"]/tr[1]/td[1]'), all I get is an empty list. I have also tried getting child nodes, but to no avail.

Cf. extracting on Wikipedia's main page, where you get everything inside the specified container:

In [3]: req2 = requests.get('https://en.wikipedia.org/wiki/Main_Page').content
   ...: xd = '//*[@id="mp-welcomecount"]'
   ...: sel2 = Selector(text=req2).xpath(xd)

In [4]: sel2.extract_first()
Out[4]: '<div id="mp-welcomecount">\n<div id="mp-welcome">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a>,</div>\n<div id="mp-free">the <a href="/wiki/Free_content" title="Free content">free</a> <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia</a> that <a href="/wiki/Help:Introduction" title="Help:Introduction">anyone can edit</a>.</div>\n<div id="articlecount"><a href="/wiki/Special:Statistics" title="Special:Statistics">6,088,421</a> articles in <a href="/wiki/English_language" title="English language">English</a></div>\n</div>'

Why is that on Trezor's case I only get the tr element and how do I correct my code to bring everything that is contained inside it?

manoelpqueiroz
  • 575
  • 1
  • 7
  • 17

1 Answers1

0

Scrapy seems a bit off when it comes to parsing the page (error with tr closing tag). There's no "parent-child" connection between the tr and the td elements. You only have siblings. Structure of the parsed page :

tr
td
 span
  img
td
 strong
 small
 a
td
 img
td
 img
td
 a
 a
 a
 a
td
 a
 a
tr
...

Maybe you can use the following XPath expression to fetch all the data from the table :

//tr[contains(@class,"coin")][1]/following-sibling::td

Output : 8364 nodes

Or look for a magic option in scrapy settings.

E.Wiest
  • 5,425
  • 2
  • 7
  • 12
  • I'm curious as to how you found out the nodes were siblings and not hierarchical. I assumed the structure was correct when using developer tools on my browser, why would it also display the relationship between `tr` and `td` incorrectly? – manoelpqueiroz May 30 '20 at 13:49
  • 1
    With scrappy shell, you can use `view(response)`. This will create a temp file (`tmpxxxxx.html`) in the following folder (Windows) : `C:/Users/yourusername/AppData/Local/Temp/`. You can now study the file with Notepad++ or similar tools. – E.Wiest May 30 '20 at 14:06