0

Objective

Scrap HTML table from warframe wikia.

Background

I am trying to get the information of a table in warframe, the Mods List table. To achieve this objective I read the HTML-parser on Node.js topic and concluded that using YQL was my best option.

Code

By using Google Chrome Dev Tools, and two chrome extensions called CSS and XPath checker and XPath Helper, I was able to pin point the exact location of the table I am looking for with the following XPath query:

//*[@id="mw-content-text"]/div[33]/div/div[1]/table/tbody

Now, Chrome says this is the correct path, and the plugins I am using suggest it as well.

Problem

The problem is that when I use YQL, the result in Json is something utterly and completely different from the talbe I am expecting. In fact, it returns a different table together with misc data.

I am baffled to why this is happening. The wikia is a simple HTML page with little to no dynamic information whatsoever, so I really can't understand why I am getting erroneous results.

What could the problem be?

Community
  • 1
  • 1
Flame_Phoenix
  • 16,489
  • 37
  • 131
  • 266

1 Answers1

0

Unfortunately, YQL does not work properly with pages that are loaded over time, as is the case wit the wikia.

So, even then the XPath is correct, when Yahoo makes the first (and only) request, it receives an incomplete HTML, and never completes it.

To fix the issue, I decided instead to locally parse the HTML in my nodejs server using the npm-request and npm-cheerio packages.

The first package downloads the full page HTML, and the second parses it for the information I am looking for.

An effective solution that instead of relying on a third party tool, transfers all the work to my server.

Hope this helps someone, in the future !

Flame_Phoenix
  • 16,489
  • 37
  • 131
  • 266