Need help parsing through a page & getting all comments found into an Array using a RegEx pattern in Google Apps Script?

Question

My problem lies in 2 parts however I'm hoping solving 1 will fix the other. I've been trying to parse through a page and get all the comments found within a forum thread.

The comments are found using a RegEx pattern and the idea is that whatever lies in the comment will be read into an array until there aren't any more comments left. Each comment div follows this format

<div id="post_message_480683" style="margin-right:2px;"> something </div>

I'm trying to locate up to "post_message_[some number]" since each number seems to be generated randomly and then get whatever is between that particular div. My 1st problem is my RegEx just doesn't seem to be working I've tried a few but none yielded any results (except for when I insert the post message no. in manually), Here's the code so far:

function GetPosts() {
   var posts = new Array(60);
   var url = "http://forums.blackmesasource.com/showthread.php?p=480683";
   var geturl = UrlFetchApp.fetch(url).getContentText().toString();
   var post_match = geturl.match(/<div id="post_message_(.+)" style="margin-right:2px;">(\w.+)<\/div>/m);
   Logger.log(post_match); 
   }

Edit: I initially tried getting this info via GAS's Xml.Parse() class but after grabbing the URL I just didn't know what to do since suffixing

.getElement().getElement('div') (I also tried .getElements('div') and other variations with 'body' & 'html')

would cause an error. Here is the last code attempt I tried before trying the RegEx route:

function TestArea() {
  var url = "http://forums.blackmesasource.com/showthread.php?p=480683";
  var geturl = UrlFetchApp.fetch(url).getContentText().toString();

  //after this point things stop making sense
  var parseurl = Xml.parse(geturl, true);
  Logger.log(geturl);

   //None of this makes sense because I don't know HOW! 
   //The idea: Store each cleaned up Message Div in an Array called posts 
   //(usually it's no more than 50 per page) 
   //use a for loop to write each message into a row in GoogleSpreasheet
    for (var i = 0; i <= parseurl - 1; i++) {
      var display = parseurl[i];
      Logger.log(parseurl); }
}

Thanks for reading!

Please see this post on [parsing HTML with a regex](http://stackoverflow.com/a/1732454/597607). — Bo Persson, Oct 15 '12 at 19:20

score 0 · Answer 1 · answered Oct 16 '12 at 14:11

In general like the comment points out - be aware of parsing HTML with RegEx.

In my past personal experience, I've used Yahoo's YQL platform to run the HTML through and using XPath on their service. Seems to work decently well for simple reliable markup. You can then turn that into a JSON or XML REST service that you can grab via UrlFetch and work on that simplified response. No endorsement here, but this might be easier than to bring down the full raw HTML into Google Apps Script. See below for the YQL Console. I also don't know what their quotas are - you should review that.

Of course, the best course is to convince the site owner to provide an RSS feed or an API.

YQL console

Thanks so much for replying! I've never heard of YQL so I'll need to look into it & learn it I guess. So! this takes care of the immediate issue-I still would like to learn GAS though. I admit the code I posted isn't the best - but I didn't know that RegEx & HTML was akin to summoning Cthulhu ( @BoPersson 's response). The code in my question was when I was grasping. Initially I tried using GAS' own XmlParse() but wasn't able to actually Parse through - I've edited the question to include my previous attempt without regex, any feedback you have will help immensely Thanks again! — Weej Jamal, Oct 17 '12 at 11:19

Need help parsing through a page & getting all comments found into an Array using a RegEx pattern in Google Apps Script?

1 Answers1