0

I have a web page that I need to extract information from.

There are multiple <article> tags that need to be cycled through (I need to extract content from within them). Each article tag has many attributes, "id", "class", etc.

I have no idea how to write the Regex that I require.

What I have so far is:

<article ([a-zA-Z\s"\S][^>]*)>

This is capable of extracting all tags with their attributes, however, I don't know how to capture the information WITHIN the tags.

I feel like I need to write regex similar to: "get everything within <article ([a-zA-Z\s"\S][^>]*)> until you reach the next </article> tag.", but have no idea how to do that...

Thanks for your input

user1298740
  • 127
  • 2
  • 12
  • It's important to let us know what language you're doing this in - each language has its own regular expression idiosyncrasies. You mentioned below that it's Browser javascript? Are using any frameworks, or importing any libraries? Underscore, jQuery, angular, etc? – FrankieTheKneeMan Nov 29 '14 at 21:53
  • Sorry about that, I'm using jQuery, however I'm not rendering the HTML so I don't believe I can access the articles as selectors. – user1298740 Nov 29 '14 at 22:22
  • That's where you're wrong, actually - if you have it in a string, you can parse it into a jquery document very easily: [Check Out this documentation](http://api.jquery.com/jQuery/#jQuery2), and [this documentation](http://api.jquery.com/find/). I think what you're looking for is: `$(htmlString).find('article').each(function(index, element) { /* Do Work */});` – FrankieTheKneeMan Nov 29 '14 at 23:19

3 Answers3

1

Regex? Please reconsider. From one of your comments: "I was building this for a Chrome Extension so it was being done with JavaScript." Then I suggest you use the browser's built-in XML DOM parser.

To load XML from a string variable xmlText:

var parser = new DOMParser();
var xmlDoc = parser.parseFromString(xmlText, "text/xml");

To load XML from a separate XML file:

var xhttp = new XMLHttpRequest();
xhttp.open("GET", "articles.xml", false);
xhttp.send();
var xmlDoc = xhttp.responseXML;

This yields a convenient object structure that you can navigate through.

var articles = xmlDoc.getElementsByTagName('article');
for (var i = 0; i < articles.length; i++) {
    var article = articles[i];
    var id = article.getAttribute('id');
    var class = article.getAttribute('class');
    var content = article.nodeValue;
    ...
}
Community
  • 1
  • 1
Ruud Helderman
  • 10,563
  • 1
  • 26
  • 45
0

Depending on your programming language, you can probably find HTML parsing libraries. If you can not find those, you could probably use libraries that loosely parse XML (parsers that don't require a full valid XML document). You could then simply get a list of article elements and parse through them individually. In case of an HTML parser you can probably also read out attributes!

If aforementioned does not work, maybe you could split the text on <\article>, and then split that text by < article (without the space) and read the second index in the array. You can then split that on > and you will be left with the element attributes on the first index, and the content on the second. If anybody finds a regex solution to this that anders this question better, please let me know!

Hope it helps.

Pim

Pim de Witte
  • 373
  • 2
  • 10
  • Thanks for your response, but I was building this for a Chrome Extension so it was being done with JavaScript. I may decide to create a PHP page that I can pass HTML to, which will then return all of the articles. – user1298740 Nov 29 '14 at 21:39
0

Normally, I hate when people give this answer, but: JQuery can do that for you!. Since you're already using the jQuery framework, take advantage of the secondary functionality of the jQuery function to parse the HTML String into a series of DOM Nodes. You can then use the find function to query the children of your top node!. Your final code will wind up looking something like this:

$(htmlString)
    .find('article')
    .each(function(index, article) {
        //Extract information from $(article).
    });
FrankieTheKneeMan
  • 6,645
  • 2
  • 26
  • 37