-1

I need to extract URLs belonging to the https://twitter.com domain from a JS string of HTML code and store them as a variable array. I know I'm looking for a RegEx (https?:\/\/(.+?\.)?twitter\.com(\/[A-Za-z0-9\-\._~:\/\?#\[\]@!$&'\(\)\*\+,;\=]*)?). My problem is that I don't know what command finds this in JS, although I have looked for it.

My project partner is populating a Google Sheets table which I'm storing as an HTML file locally, which I fetched on a separate HTML page and pushed to the console as such below. My end goal is to have the links of twitter profiles he put in multiple columns in a JS array for later use.

fetch('Directory.html').then(function (response) {
    return response.text();
}).then(function (html) {
    console.log(html);
}).catch(function (err) {
    console.warn('Ooga booga.', err);
});

Any insight is appreciated. I love this community, blessings to you all.

Edit

On the heels of a comment below, I've implemented this code, yet Chromium console prints the entire document as if it's filtering nothing. Why is this? I initially tried it without the forwardslash / before and after the regex content, but Chromium console complained of an unexpected : (colon) token. Why is this?

fetch('Directory.html').then(function (response) {
    // The API call was successful!
    return response.text();
}).then(function (html) {
    // This is the HTML from our response as a text string
    console.log(html);
}).catch(function (err) {
    // There was an error
//  console.warn('Something went wrong.', err);
});
const paragraph = html;
const regex = /(https?:\/\/(.+?\.)?twitter\.com(\/[A-Za-z0-9\-\._~:\/\?#\[\]@!$&'\(\)\*\+,;\=]*)?)/;
const found = paragraph.match(regex);

console.log(found);
  • You can start [here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match). **But**: Using regex to parse HTML or JavaScript is highly problematic and generally frowned upon. For example, how can you be sure the string you have matched is not within a comment or within a quoted string? You need something more powerful than what a JavaScript regex provides for doing that. – Booboo May 31 '21 at 20:03
  • @Booboo Thank you for the lede. I'm sure I'll handle it from here. In this specific case that issue won't arise, but for future reference what would better suit? Maybe another language completely? – Howard Crane May 31 '21 at 20:58
  • [Here](https://stackoverflow.com/questions/9540218/a-javascript-parser-for-dom) are some ideas. – Booboo May 31 '21 at 21:23

1 Answers1

0

Showing my own work here. Much thanks to @Booboo.

fetch('Directory.html').then(function (response) {

    return response.text();
}).then(function (html) {

const paragraph = html;
const regex = /(https?:\/\/(.+?\.)?twitter\.com(\/[A-Za-z0-9\-\._~:\/\?#\[\]@!$&'\(\)\*\+,;\=]*)?)/g;
const found = paragraph.match(regex);
console.log(found);
});

I used a library called csi.js to fetch an external HTML document.

const paragraph = html is probably a redundant line.

const regex = identifies "https://twitter.com/" as the text I want, with g flag to get all instances in the string instead of just one.

const found = line looks for the matches in the string.

console.log prints the result into the browser's console.