-1

I have a string generated with this format :

'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>'.

I want to create a regular expression . I want a regular expression that just fetches me the content without the div tag and the source of the image in an array.

Example in my case: [ 'fdffddf', 'folder/myimg.jpg' ]

I tried this method :

let str = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>'
console.log('only content without div tag and src image only without img tag : ',str.match(/<img [^>]*src="[^"]*"[^>]*>/gm)[0]) 

It doesn't work. I get only the img tag.

How can I do it please ?

Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
peace3106
  • 37
  • 4
  • 3
    Experts always advice NOT to parse html with regex, you should tools/languages which understand html well IMHO. – RavinderSingh13 Oct 25 '22 at 13:06
  • [Why not to use a regex?](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) - The safe way would be to use `DOMParser()` to parse the markup into actual `HTMLElements` and then extract the relevant parts. – Andreas Oct 25 '22 at 13:10
  • Hi @Andreas,would it be possible to do it only in javascript with a regular expression or even two regular expressions to arrive at the final result – peace3106 Oct 25 '22 at 13:15
  • Maybe. Depends on the content of `str`. But I wouldn't do it that way - unless the content of `str` is super-simple and not provided by the user or any external resource. Just use a `DOMParser()`... – Andreas Oct 25 '22 at 13:19
  • hi @Andreas , More simply, if possible, can you give me an example in my case where I just retrieve the content inside the div tag or outside the div tag with a regular expression in javascript – peace3106 Oct 25 '22 at 13:24

2 Answers2

0

Please keep in mind that using regex to parse HTML is error prone. If you want to play safe it is better to use an HTML parser.

Here is a regex approach to extract the img src and to strip the remaining text of HTML:

const str = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>';
let arr = str
  .replace(/<img\b.*?src="([^"]*).*?>/, '$1') // extract img src
  .split(/<\/?[a-z]\w*\b[^>]*>/i)  // split on HTML tags
  .filter(s => s.trim()); // filter out empty items and spaces only items
console.log(arr);

Output:

[
  "fdffddf",
  "folder/myimg.jpg"
]

Explanation of .replace() regex:

  • <img -- start of tag with tag name
  • \b -- word boundary
  • .*? -- non-greedy scan until:
  • src=" -- literal src=" text
  • ([^"]*) -- capture group with everything not a double quote
  • .*? -- non-greedy scan until:
  • > -- end of tag

Explanation of .split() regex:

  • < -- start of tag
  • \/? -- optional slash (end tag)
  • [a-z]\w* -- tag name: single alpha char followed by 1+ word chars
  • \b -- word boundary after tag name
  • [^>]* -- scan over anything not end of tag
  • > -- end of tag
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
-1

You can do it with the following regex:

/^(?<!<)(\b[^<>]+\b)(?!>).*(?<=")(\b.+\b)(?=")/

This regex uses two capturing groups, one for the string at the beginning and one for the image source.

Try the following code:

const string = 'fdffddf<div><br> <div><img style="max-width: 7rem;" src="folder/myimg.jpg"><div> <br></div></div><div><div> <br></div></div></div>';
const regex = /^(?<!<)(\b[^<>]+\b)(?!>).*(?<=")(\b.+\b)(?=")/;
const match = string.match(regex);
console.log(`text: ${match[1]} - source: ${match[2]}`);