From an HTML string with comments like `` extract only the characters after `className`

Question

There is a string like this

var string =
`<!-- paragraph {"className":"123"} -->
<p>abc</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"456"} -->
<p>cde</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"789"} -->
<p>fgh</p>
<!-- /paragraph -->`
 
const regex = /"className":"(.+)"/g;

string = string.replace(regex, "");
console.log(string);

I want to delete all characters except those following the className.

In other words, I want to make it look like 123456789 in the end.

If you can tell me a better way, I would appreciate your advice.

Any method other than replace is fine.

`.replace(/\D/gm, "")`? Or could there be digits elsewhere that you don't want to keep? — T.J. Crowder, Apr 07 '22 at 09:18

Sebastian Simon · Answer 1 · 2022-04-07T10:04:17.140

.+ is a greedy pattern and you only rely on . not matching a linebreak (without the s flag). This might not be robust.

Your string is an HTML string, so using an HTML parser is a more appropriate start, using DOMParser. Since HTML comments can be placed anywhere in HTML, the HTML parser will place these contents automatically in different places; wrap the string in a <body>…</body> to make sure everything is placed in one consistent spot. You can later access the contents by .body.childNodes.

Next, use Array.from to convert the list of Nodes into a proper Array and filter it

by nodeType to get only the HTML comment nodes (using the static properties on Node), and
by textContent to get only those comments starting with paragraph and not those starting with /paragraph (using trim and startsWith).

map over the resulting comment nodes to get their text contents.

Now it’s a bit unclear what the format is. Is it always one word (with no spaces), followed by a single space, followed by the {…} structure? Can there be multiple {…} structures? Can there be something after the {…} structure? You’ll have to figure this out for yourself and refine any regex, but I’m going to assume that the paragraph…/paragraph thing is analogous to HTML tags, which would mean that the first space is followed by the {…} structure. However, I’m not going to assume that these {"className":"123"} structures are always going to be free of spaces.

Splitting only by the first space, discarding the text before, and keeping the rest can be achieved by splitting by all spaces, taking everything from index 1, and merging everything by a space again: .split(" ").slice(1).join(" ").

The intermediate result is:

[
  "{\"className\":\"123\"}",
  "{\"className\":\"456\"}",
  "{\"className\":\"789\"}"
]

These are JSON strings. Use JSON.parse (in the existing map) to parse everything and access the className property.

Now you have this inermediate result:

[
  "123",
  "456",
  "789"
]

Joining it all with .join("") results in the desired "123456789" string.

Full code

const string =
`<!-- paragraph {"className":"123"} -->
<p>abc</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"456"} -->
<p>cde</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"789"} -->
<p>fgh</p>
<!-- /paragraph -->`,
  result = Array.from(new DOMParser().parseFromString(`<body>${string}</body>`, "text/html")
    .body
    .childNodes)
      .filter(({ nodeType, textContent }) => nodeType === Node.COMMENT_NODE && textContent.trim().startsWith("paragraph"))
      .map(({ textContent }) => JSON.parse(textContent.trim()
        .split(" ")
        .slice(1)
        .join(" "))
          .className)
            .join("");

console.log(result);

score 0 · Accepted Answer · answered Apr 07 '22 at 09:24

Use match then replace the string you don't want

var string =
`<!-- paragraph {"className":"123"} -->
<p>abc</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"456"} -->
<p>cde</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"789"} -->
<p>fgh</p>
<!-- /paragraph -->`
 
const regex = /"className":"(.+)"/g;
let arr = string.match(regex, '');
arr = arr.map(e => e.replace(`"className":"`, '').slice(0, -1))
let str = arr.join('')
console.log(str);

score 0 · Answer 3 · answered Apr 07 '22 at 09:48

What you have tried is replacing,while you have to try to "find" a match

Two examples are given below What i have gave below looks just like your question,
You have tried string.replace(regex);,
but answer is string.replace(regex,function(match, g1, g2) {/*...*/});

var string =
`<!-- paragraph {"className":"123"} -->
<p>abc</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"456"} -->
<p>cde</p>
<!-- /paragraph -->
 
<!-- paragraph {"className":"789"} -->
<p>fgh</p>
<!-- /paragraph -->`
 
const regex = /"className":"(.+)"/g;


var answer="";
var answerArray=[];
string.replace(regex, function(match, g1, g2) { 
/*
use g1 variable for anything 
like array or string,two examples are given below...
*/
answer+=g1;
answerArray.push(g1);
});
console.log(answer);
console.log(answerArray);

From an HTML string with comments like `` extract only the characters after `className`

3 Answers3

Full code

What you have tried is replacing,while you have to try to "find" a match

Refer Here