0

Description

I'm attempting to extract URLS and/or CDATA from XML. The current solution I have works well, but only returns the first element. How do I return multiple elements with this specfic regex?

The XML is in the form of:

<MediaFile>
https://some_url.com/file.mp4
</MediaFile>
<MediaFile>
https://some_url2.com/file.mp4
</MediaFile>

and

<MediaFile>
<!CDATA some data here with spaces sometimes>
</MediaFile>
...etc

What I'm trying to achieve

In my example, there are 3 mediafile tags and I'm trying to extract 3 different URLS and CDATA. The final solution should look something like

1st url https://example1.com/file.mp4
2nd url https://example2.com/file.mp4
3rd url <!CDATA some data example>

What I've tried:

link to regex101

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

const regex = /<MediaFile[^>]*type="video\/mp4"[^>]*>([\s\S]*?)<\/MediaFile>/gm;

const res = regex.exec(data);

console.log('1st url', res[1]);
console.log('2nd url', res[2]);
console.log('3rd url', res[3]);
kemicofa ghost
  • 16,349
  • 8
  • 82
  • 131
  • Possible duplicate of [How can I match multiple occurrences with a regex in JavaScript similar to PHP's preg\_match\_all()?](https://stackoverflow.com/questions/520611/how-can-i-match-multiple-occurrences-with-a-regex-in-javascript-similar-to-phps) – MonkeyZeus Sep 20 '19 at 13:54
  • It is not possible reliably parse XML with a regular expression. It's the wrong tool for this job. Why not use an XML parser and save yourself a headache? – spender Sep 20 '19 at 13:57
  • @spender xml parser doesn't work for that specific kind of xml. As these are external XMLs I have no control on what kind of XML I'll get. – kemicofa ghost Sep 20 '19 at 14:30

2 Answers2

1

You can try to parse it.

   const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;
    
    const parser = new DOMParser();
    const xmlDoc = parser.parseFromString(data,"text/html");
    
    console.log(xmlDoc.getElementsByTagName("MediaFile")[0].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[1].innerHTML);
    console.log(xmlDoc.getElementsByTagName("MediaFile")[2].innerHTML);
Roman Panevnyk
  • 313
  • 3
  • 7
1

It is probably better, not to use regular expressions, but the method document.querySelectorAll() to parse it instead:

const data = `<MediaFile delivery="progressive" width="640" height="360" type="video/mp4" bitrate="397" scalable="false" maintainAspectRatio="false">https://example1.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false">https://example2.com/file.mp4</MediaFile><MediaFile delivery="progressive" width="1024" height="576" type="video/mp4" bitrate="1280" scalable="false" maintainAspectRatio="false"><!CDATA some data example></MediaFile>`;

var o=document.createElement('div');o.innerHTML=data.replace(/<!CDATA/g,'!CDATA');
var arr=Array.from(o.querySelectorAll('MediaFile'))
             .map(el=>el.innerHTML.replace('!CDATA','<!CDATA')
                                  .replace('&gt;','>'))

console.log(arr.join('\n'));

With a little "extra effort" you can mask the <!CDATA ... > sections with a replace() before creating the DOM element and later replace it "back" into its intended form by applying .replace('!CDATA','<!CDATA').replace('&gt;','>' on the .innerHTML-strings of the MediaFile elements.

Carsten Massmann
  • 26,510
  • 2
  • 22
  • 43