-3

I'm trying to sort out the following code that I have using regex and need some help.

This is the text that I have saved to a variable after fetching it from a website.

[ '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font><tr><td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>' ]

This text can also consist of more or fewer tags, as this is fetched from a website where each 'profile' is different.

What I'd like it to return is

Name: Username   
Crew: My Crew   
Wealth: Rich   
Rank: Hitman
Status: Alive ( Online )
Messages sent: 3
Messages received: 1

All help is appreciated! Thanks

Roko C. Buljan
  • 196,159
  • 39
  • 305
  • 313
  • 1
    Use an HTML parser, not regex ([obligatory link](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)). There are several for Node.js. – T.J. Crowder Aug 26 '18 at 16:38
  • To improve your question, providing the full super-set of any expected string matches and the vales you want them to have would be beneficial. Also as @T.J.Crowder already said, using an HTML parser would be much more efficient as well. – AlexanderGriffin Aug 26 '18 at 16:44
  • @BhojendraRauniyar please read: *`"...after fetching it from a website"`* - so there's still hope – Roko C. Buljan Aug 26 '18 at 16:47
  • @RokoC.Buljan, ah, I din't notice. – Bhojendra Rauniyar Aug 26 '18 at 16:49
  • @BhojendraRauniyar As Roko said, this is the response after fetching a table from a website. – Jan Henning Aug 26 '18 at 16:49
  • @JanHenning, yeah, but you shouldn't rely on such ghost website. – Bhojendra Rauniyar Aug 26 '18 at 16:51
  • @BhojendraRauniyar why? Who said the site is built using ` `? It might be a completely W3C compliant website (just, clearly, an old one). – Roko C. Buljan Aug 26 '18 at 16:51
  • @RokoC.Buljan html5 is not necessary to consider to be relied upon, but even not so old fashioned like font, marquee, etc which is really worthless, they may update their website anytime later and parsing data from their site is somekind of headache. – Bhojendra Rauniyar Aug 26 '18 at 16:54
  • @BhojendraRauniyar He's trying to fetch using Node.js an external website's content. I cannot follow up any of your comments. They make no sense. – Roko C. Buljan Aug 26 '18 at 16:57
  • 1
    @JanHenning As suggested it's never good to parse HTML with regex. If in Node, take a look at https://www.npmjs.com/package/jsdom and simply retrieve `textContent` out of Elements - I'd just advise to first create an inmemory `` and `` as wrappers appending your `
    ` strings as elements - before trying to get the content.
    – Roko C. Buljan Aug 26 '18 at 16:58
  • 1
    @RokoC.Buljan Thank you, will have a look. Sorry for the troubles. – Jan Henning Aug 26 '18 at 16:59

1 Answers1

1

You could use DocumentFragment to extract the desired data from <td> elements.
For Node take a look at some helpers like this one: jsdom@npmjs

const td = [ '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>' ];

const tr = document.createElement("tr");
const table = document.createElement("table");
const frag = document.createDocumentFragment(); // Minimal Document wrapper

tr.innerHTML = td.join("");
table.appendChild(tr);
frag.appendChild(table);

const data = [...frag.querySelectorAll("td")].reduce((ob, td) => {
  const a = td.textContent.split(':');
  ob[a[0].trim()] = a.slice(1).join(":").trim();
  return ob;
}, {})

console.log( data );

PS:

!!!? in your array you had a </font><tr><td ← itshould be </font></td>', '<td - which I fixed above (didn't had to... since it was parsed correctly). So yeah, first make sure you're getting a well formatted HTML array at least.

It's exactly about such things that parsing HTML with regex is a bad idea. Even with the above mistake - the HTML is parsed correctly-sh - but extracting contents, strictly using regexp, would make it absolutely fail.


Using jsdom for Node - your code should look like:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const td = ['<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Name: </font><a href="site.php?page=send&sendto=Username"><font color="#999999">Username</font></a>&nbsp;&nbsp;&nbsp;</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Crew: </font><a href="site.php?page=crewprofile&id=2120"><font color="#999999">My Crew</font></a> </td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Wealth: Rich</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Rank: Hitman</td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Status: Alive ( </font><font color=green>Online</font><font color="#999999"> )</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages sent: 3</font></td>', '<td bgcolor="#2D2F34">&nbsp;<font color="#999999">Messages received: 1</font></td>'];

const dom = new JSDOM(`<table><tr>${td.join("")}</tr></table>`);
const frag = dom.window.document;

const data = [...frag.querySelectorAll("td")].reduce((ob, td) => {
    const a = td.textContent.split(':');
    ob[a[0].trim()] = a.slice(1).join(":").trim();
    return ob;
}, {});

console.log( data );
Roko C. Buljan
  • 196,159
  • 39
  • 305
  • 313