0

I have some test data in the following format -

"lorem ipsum <img src='some_url' class='some_class' /> lorem ipsum <img src='some_url' class='some_class' /> ipsum <img src='some_url' class='some_class' />"

Now, my goal is to identify all the image tags along with their respective source urls and css classes and store them together with the remaining text in an ordered array like -

["lorem ipsum", {imageObject1}, "lorem ipsum", {imageObject2}, "ipsum", {imageObject3}]

Now for this I tried to create a sample regex

var regex = /(.*(<img\s+src=['"](.+)['"]\s+(class=['"].+['"])?\s+\/>)+?.*)+/ig

Now when I try this regex with the sample text i am getting -

regex.exec(sample_text) => [0:"lorem ipsum <img src='some_url1' class='some_class1' /> lorem ipsum <img src='some_url2' class='some_class2' /> ipsum <img src='some_url3' class='some_class3' />"
1:"lorem ipsum <img src='some_url1' class='some_class1' /> lorem ipsum <img src='some_url2' class='some_class2' /> ipsum <img src='some_url3' class='some_class3' />"
2:"<img src='some_url3' class='some_class3' />"
3:"some_url3"
4:"class='some_class3'"]

How in javascript can I transform the sample html text into an array of tagged html objects with their attributes.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
Harshit Laddha
  • 2,044
  • 8
  • 34
  • 64
  • You need to use `String.prototype.match()` – user2226755 Oct 22 '17 at 08:11
  • well it's a bit complicated you could try to use a regex inside `.split()` to split the input to the form of your desired output. in addition you could then just run your regex over each part of the resulting array to extract your desired data. – GottZ Oct 22 '17 at 08:13
  • 1
    https://stackoverflow.com/a/1732454/1682509 – Reeno Oct 22 '17 at 08:13
  • well.. maybe you even want to use DOM operations to do this. i don't see a reason as to why you are even trying to mess with html through regex – GottZ Oct 22 '17 at 08:17
  • I wanted to create dynamic pdfs with PDFMake.js and they require document object definitions in a strict format as I specified above so that is why I wanted to try and parse the HTML content with the help of regex. split seems a good option to try I completely forgot about this one, but I believe DOMParser would be a good fit for my task here so I will also try it for once – Harshit Laddha Oct 22 '17 at 08:32
  • use DOM operations for that. seriously. regex is the wrong way. – GottZ Oct 22 '17 at 08:37

1 Answers1

1

Do not use regular expressions to parse HTML. Use a DOMParser to parse the string and then CSS queries to get the images from the DOM, it will be much more reliable and easier to read.

var html = "lorem ipsum <img src='some_url' class='some_class' /> lorem ipsum <img src='some_url' class='some_class' /> ipsum <img src='some_url' class='some_class' />"

var nodes = new DOMParser().parseFromString(html, "text/html").body.childNodes

That will get you almost what you wanted (just some empty Text nodes you can filter out).

Or do something a little bit more accurate like this in case you don't have just images and text in the HTML:

var images = new DOMParser().parseFromString(html, "text/html").querySelectorAll("img")
var array = new Map([...images].map(img => [img.previousSibling.nodeValue, img]))
Touffy
  • 6,309
  • 22
  • 28