How to get only text inside div tag and src content of img tag in by using regular expression javascript

Question

I have a string generated with this format :

'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>'.

I want to create a regular expression . I want a regular expression that just fetches me the content without the div tag and the source of the image in an array.

Example in my case: [ 'fdffddf', 'folder/myimg.jpg' ]

I tried this method :

let str = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>'
console.log('only content without div tag and src image only without img tag : ',str.match(/<img [^>]*src="[^"]*"[^>]*>/gm)[0])

It doesn't work. I get only the img tag.

How can I do it please ?

Experts always advice NOT to parse html with regex, you should tools/languages which understand html well IMHO. — RavinderSingh13, Oct 25 '22 at 13:06
[Why not to use a regex?](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) - The safe way would be to use `DOMParser()` to parse the markup into actual `HTMLElements` and then extract the relevant parts. — Andreas, Oct 25 '22 at 13:10
Hi @Andreas,would it be possible to do it only in javascript with a regular expression or even two regular expressions to arrive at the final result — peace3106, Oct 25 '22 at 13:15
Maybe. Depends on the content of `str`. But I wouldn't do it that way - unless the content of `str` is super-simple and not provided by the user or any external resource. Just use a `DOMParser()`... — Andreas, Oct 25 '22 at 13:19
hi @Andreas , More simply, if possible, can you give me an example in my case where I just retrieve the content inside the div tag or outside the div tag with a regular expression in javascript — peace3106, Oct 25 '22 at 13:24

Peter Thoeny · Accepted Answer · 2022-10-26T17:28:11.207

Please keep in mind that using regex to parse HTML is error prone. If you want to play safe it is better to use an HTML parser.

Here is a regex approach to extract the img src and to strip the remaining text of HTML:

const str = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>';
let arr = str
  .replace(/<img\b.*?src="([^"]*).*?>/, '$1') // extract img src
  .split(/<\/?[a-z]\w*\b[^>]*>/i)  // split on HTML tags
  .filter(s => s.trim()); // filter out empty items and spaces only items
console.log(arr);

Output:

[
  "fdffddf",
  "folder/myimg.jpg"
]

Explanation of .replace() regex:

<img -- start of tag with tag name
\b -- word boundary
.*? -- non-greedy scan until:
src=" -- literal src=" text
([^"]*) -- capture group with everything not a double quote
.*? -- non-greedy scan until:
> -- end of tag

Explanation of .split() regex:

< -- start of tag
\/? -- optional slash (end tag)
[a-z]\w* -- tag name: single alpha char followed by 1+ word chars
\b -- word boundary after tag name
[^>]* -- scan over anything not end of tag
> -- end of tag

score -1 · Answer 2 · answered Oct 25 '22 at 15:03

You can do it with the following regex:

/^(?<!<)(\b[^<>]+\b)(?!>).*(?<=")(\b.+\b)(?=")/

This regex uses two capturing groups, one for the string at the beginning and one for the image source.

Try the following code:

const string = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>';
const regex = /^(?<!<)(\b[^<>]+\b)(?!>).*(?<=")(\b.+\b)(?=")/;
const match = string.match(regex);
console.log(`text: ${match[1]} - source: ${match[2]}`);

How to get only text inside div tag and src content of img tag in by using regular expression javascript

2 Answers2