3

I'm parsing HTML with regex in node.js to return a string. However, I have been told that this is not a good idea in this post: Pull a specific string from an HTTP request in node.js

What are the more stable alternatives?

I'm new to programming, so links to tutorials would be very helpful. I have trouble understanding some of the documentation explanations.

Community
  • 1
  • 1
mnort9
  • 1,810
  • 3
  • 30
  • 54
  • 1
    You've already been informed of the issue but you should probably read [this](http://goo.gl/i8h6) just to be totally informed. The basic issue has to do with the theoretical "power" of the "machine" model of regular expressions versus what's required to parse a language like HTML. It has to do with language/automata theory. – Pointy Apr 07 '12 at 22:24
  • You can see this : http://stackoverflow.com/questions/7372972/how-do-i-parse-a-html-page-with-node-js – HoLyVieR Apr 07 '12 at 22:58

1 Answers1

3

node-htmlparser handles all of the heavy lifting of parsing HTML. On top of that, node-soupselect lets you use CSS-style selectors to find the particular element you're looking for.

However, I looked at your other question and the question you should really be asking is not "how do I scrape this data from a HTML page", but rather "is there a better way to retrieve the data I'm looking for?" The USGS has APIs that provide their data in machine-readable form.

Here's the JSON object for the location you're intersted in. To get the "most recent instantaneous value" for the elevation of reservoir surface, you'd download that file, do a var d = JSON.parse, and:

for (var i = 0; i < d.value.timeSeries.length; i++) {
    if (d.value.timeSeries[i].variable.variableName == 'Elevation of reservoir water surface above datum, ft') {
        var result = d.value.timeSeries[i].values[0].value[d.value.timeSeries[i].values[0].value.length-1];
    }
}

result will now look like { dateTime: "2012-04-07T17:15:00.000-05:00", value: "1065.91" }.

josh3736
  • 139,160
  • 33
  • 216
  • 263
  • Do I define `var d = JSON.parse` and the for statement in my `http.get` callback? – mnort9 Apr 09 '12 at 22:06
  • `http.get(..., function(res) { ... });` will call your callback when it makes a connection and *begins receiving data* -- not when it is complete. You have to listen for data (`res.on('data', function(chunk) { ... });`) and buffer the incoming data, which you can then use to call `JSON.parse(bufferString)` once `res` emits `end`. [See here for example.](http://nodemanual.org/latest/nodejs_dev_guide/creating_http_requests.html) – josh3736 Apr 09 '12 at 22:24