Regex for XML document

Question

I am trying to come up with a regex for an XML document which is essentially a DASH mpd file. Use case is that this XML document has AdaptationSet tag which in-turn can have multiple Representation tags as shown. I need to match all Representation tag which have bandwidth attribute more than the specified input i.e 2000000 or 4000000 shown below. I could come up with the following one but it doesn't address the case when attributes span multiple lines as shown in Representation with id=1.

RANGE in regex can take any value from 1-9 which can be assumed to be in integer format ready to be consumed by regex. RANGE with following 6 digits will make the match to be made for bandwidth value of 1000000 or 2000000 or 3000000 and so on based on whether value of RANGE is 1 or 2 or 3 respectively.

regex:

<[Rr]epresentation.*?[Bb]andwidth="0?[%(RANGE)]\d{6}"[\s\S]*?[Rr]epresentation>

    <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
     <Representation id="1" 
        mimeType="video/mp4" 
        codecs="avc1.4d401f" 
        width="512" 
        height="288" 
        frameRate="24" 
        sar="1:1" 
        startWithSAP="1" 
        bandwidth="1000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>
    </AdaptationSet>

It might be a little easier using a command line XPath tool like the ones mentioned here - https://stackoverflow.com/a/15461774/356887 — xdhmoore, May 02 '20 at 07:03
2000000 is 7 digits. `0?[%(RANGE)]\d{6}` makes little sense if you need to match any 1+ digits. Replace `0?[%(RANGE)]\d{6}` with `\d+`. See https://regex101.com/r/MZXNqN/1 — Wiktor Stribiżew, May 02 '20 at 09:48
@WiktorStribiżew suggested solution doesn't check for the logic required for bandwidth value like i mentioned above, say i wanted to match only Representation having bandwidth higher than 7000000. This solution matches for any number of digits for bandwidth. — Rajan Kalra, May 02 '20 at 11:24
So, all you need is a regex that matches any number greater than 2000000? — Wiktor Stribiżew, May 02 '20 at 11:26
This number is going to be defined by RANGE variable, RANGE can have a value anything b/w 1-9, which inturn will make the match to be made for 1000000 or 2000000 or 3000000 and so on. — Rajan Kalra, May 02 '20 at 11:29
How you're going to provide the range or so to speak the logic for which `` tag to match? — , May 02 '20 at 12:05
@Mandy8055 It will be given by the user, thus will be dynamic. Cannot hard code it. This value will come for each mpd on the fly. — Rajan Kalra, May 02 '20 at 12:31
@xdhmoore wish I had a choice but I am constrained to use regex only. — Rajan Kalra, May 02 '20 at 12:32
It means we need to construct the regex after the user input. How you'll get the input i.e. the input format? — , May 02 '20 at 12:32
@Mandy8055 we can assume the RANGE variable to have an integer value which can directly be used inside regex — Rajan Kalra, May 02 '20 at 13:10
You mean you can’t use XPath or you can’t use command line tools? What language are you using, python? — xdhmoore, May 02 '20 at 20:29
"constrained to use regex only". That is a really strange constraint. Why can't you use an XML library, such as ElementTree? — mzjn, May 03 '20 at 05:47
i am not a python developer. you can select representation tag which has bandwidth 2000000 or 4000000. ()(?<=[24]000000">)[\n\s<]+.+[\n\s<]+.Representation> — Muhammad Numan, May 07 '20 at 08:17
@MuhammadNuman your solution is matching till the end of start tag, however I need to match the complete Representation i.e until the ending Representation tag. . . . — Rajan Kalra, May 07 '20 at 20:01
@MuhammadNuman pls refer https://regex101.com/r/MZXNqN/2. Here Group1 should be matching until the ending tag of this particular Representation i.e — Rajan Kalra, May 07 '20 at 20:11
()(?<=[24]000000">)[\n\s<]+.+?<\/Representation> can you try this now? — Muhammad Numan, May 08 '20 at 06:10
I believe the canonical answer to this question is https://stackoverflow.com/a/1732454/683329 — Jiří Baum, May 11 '20 at 07:37
@MuhammadNuman thanks for your solution, it helped me modify my solution with minimal changes. check it out here https://regex101.com/r/MmUkzc/9. Can you add your solution below, would like to make it accepted answer. — Rajan Kalra, May 12 '20 at 06:17

xdhmoore · Answer 1 · 2020-05-07T18:07:21.757

Update

I would recommend the ElementTree version at the bottom. But here's a regex version as requested:

import re

txt = """
 <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
     <Representation id="1" 
        mimeType="video/mp4" 
        codecs="avc1.4d401f" 
        width="512" 
        height="288" 
        frameRate="24" 
        sar="1:1" 
        startWithSAP="1" 
        bandwidth="1000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>
    </AdaptationSet>
"""

input=2000000

reps = re.findall(r'<\s*representation(?:\s*\w+="[^"]*")*\s*>.*?<\/\s*representation\s*>',
    txt, flags=re.IGNORECASE + re.DOTALL)


for rep in reps:
    bandwidth = int(re.search(r'bandwidth="([^"]*)"', rep, flags=re.IGNORECASE).group(1))
    if (bandwidth > input):
        print(rep)

I think it's easier to do it in a couple steps:

Chunk out the Representation's one by one. The regex above does that, but you could probably replace the attribute-matching part (the part in the non-capture group(?:\s*\w+="[^"]*")*\s*>) with something simpler like [^>]*?>, since you just need the whole Representation element & its children. To break down the full regex:
- <\s* - matches < followed by 0 or more whitespaces
- representation - matches representation, obviously. The IGNORECASE flag makes sure this matches case variations
- (?:\s*\w+="[^"]*")* - this matches zero or more attributes of the form blab_blah="value123", including whitespace around them. The (?: means it's a non-capturing group, so it isn't available via the python group() method afterwards. It's just there for the sake of repetition, ie, zero or more attributes, or (?:...)*. Again since you don't need the attribute matching here this could be simplified to something like like [^>]*?>, but it works for me.
- \s*> - spaces followed by >
- .*? - a bunch of content inside the element (including newlines due to the DOTALL flag), but anti-greedy matching so we make sure we stop at the first close tag we encounter and don't match a later one.
- <\/\s*representation\s*> - close tag, with optional whitespace
Once we have each "representation" element we can pull out the bandwidth into a first-class python integer to make it easy to compare to the input
Filter based on the value of bandwidth.

I think it's easier to pull out the bandwidth into an integer and compare it with the input than to try to work out an integer comparison within the regex itself.

Also note that if there are no (or more than 1) instances of the bandwidth attribute, the code doesn't handle that. There are probably other brittle aspects...

And here's the version using ElementTree. The reason this is in general better is that you aren't depending on your own ability to parse out the particulars of all possible combinations of XML grammar. Using a library means they've already thought out all that stuff and all you have to match are small pieces like the names of elements and attributes, so the code is less likely to break. But maybe this is a homework question or something...

import xml.etree.ElementTree as ET

input = 4000
tree = ET.parse('content.xml')
root = tree.getroot()
nodes = [n for n in root.findall('Representation') if int(n.attrib['bandwidth']) >= input]
print(nodes)

score 0 · Answer 2 · edited May 11 '20 at 07:36

Try this more robust RegEx

Input:
range 1 - 9

Output:
bw[0] contains whole open to close element
bw[2] contains bandwidth

>>> import re
>>>
>>> range = "2"
>>>
>>> regx = r"(?s)(<[Rr]epresentation(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\s)[Bb]andwidth\s*=\s*(?:(['\"])\s*0*([" + \
...        range + \
...        r"-9]\d{6}|[1-9]\d{7,17})\s*\2))(?=(\s+(?:\".*?\"|'.*?'|[^>]*?)+>))\4(?<!/>).*?</[Rr]epresentation\s*>)"
>>>
>>> txt = """
...  <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
...      <Representation id="1"
...         mimeType="video/mp4"
...         codecs="avc1.4d401f"
...         width="512"
...         height="288"
...         frameRate="24"
...         sar="1:1"
...         startWithSAP="1"
...         bandwidth="1000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
...       </Representation>
...       <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
...       </Representation>
...       <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
...       </Representation>
...     </AdaptationSet>
... """
>>>
>>> bands = re.findall( regx, txt )
>>> for bw in bands:
...     print ( bw[2] + " : " )
...     print ( bw[0] )
...     print ( "" )
...
2000000 :
<Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>

4000000 :
<Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>

>>>

this solution looks to be working with a quick look, let me try out some more tests. — Rajan Kalra, May 07 '20 at 20:15
Thanks for your solution Edward, it worked for me. However on looking it closely, I found that the accepted answer uses less no of steps. Check it out below: your solution: https://regex101.com/r/MmUkzc/11 accepted solution: https://regex101.com/r/MmUkzc/10 — Rajan Kalra, May 12 '20 at 06:26
I wont use accepted, is that what you ask ? https://regex101.com/r/6aR8oM/1 — , May 12 '20 at 18:11

score 0 · Answer 3 · answered May 10 '20 at 07:10

Learn from your and Edward's code. But I don't recommend that you use regex parsing of XML directly.

n = '4'
reg = '<[Rr]epresentation.*?[Bb]andwidth="(['+n+'-9]\d{6}|\d{8})[\d]*"[\s\S]*?</[Rr]epresentation>'

Give you an example of using SimplifiedDoc.

from simplified_scrapy import SimplifiedDoc
html = '''Your xml'''
doc = SimplifiedDoc(html)
n = '4'
Representations = doc.selects('Representation|representation').containsReg('(['+n+'-9]\d{6}|\d{8})[\d]*',attr='bandwidth')
print(Representations)

Result:

[{'id': '3', 'mimeType': 'video/mp4', 'codecs': 'avc1.4d401f', 'width': '768', 'height': '432', 'frameRate': '24', 'sar': '1:1', 'startWithSAP': '1', 'bandwidth': '4000000', 'tag': 'Representation', 'html': '\n        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />\n    '}]

score 0 · Accepted Answer · answered May 12 '20 at 06:20

0

you can use this regex

<[Rr]epresentation[^>]*?[Bb]andwidth="0?[2-9]\d{6}"[\s\S]*?[Rr]epresentation>

https://regex101.com/r/MmUkzc/9

answered May 12 '20 at 06:20

Muhammad Numan

23,222
6
63
80

Regex for XML document

4 Answers4