1

I'm looking to parse a bunch of link tags and output two specific parts.

<a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812}' onclick='GoToLink(this);return false;'>Customize &quot;Sample List&quot;</a>

I need to capture the guid '2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812' and part of the tag content, in this case 'Sample List'.

I can using the following to capure each of them in seperate lists:

For guid: [a-fA-F0-9]{8}-([a-fA-F0-9]{4}-){3}[a-fA-F0-9]{12}
For tag content: (?<=Customize &quot;)((.*)(?=&quot;))

However I can't seem to group these together to ensure the tag content and guid are coming from the same node.

Any help would be appreciated.

2 Answers2

0

Description

<a\b[^>]*List=[{]([a-fA-F0-9]{8}-(?:[a-fA-F0-9]{4}-){3}[a-fA-F0-9]{12})[}][^>]*>[^<]*((?:\bCustomize\b\s(["']|&quot;))(.*)\3)[^<]*</a>

This will ensure your guid value is inside the value set, and the anchor tag includes your quoted value. The reqex assumes the sample list is surrounded by the same open and close quotes.

enter image description here

Groups for sample string

Sample

Groups

0: <a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812}' onclick='GoToLink(this);return false;'>Customize &quot;Sample List&quot;</a>
1: (2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812)
2: (Customize "Sample List")
3: (")
4: (Sample List)

Disclaimer

There are some edge cases which this will not work for, but providing your input text similar to your sample here you should be find. If not then you should really look to use a HTML parsing.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
0

I dont know Perl so I can't write this script in Perl right now. This is in python and it should be pretty straight forward what happens here. If you know Perl I'm positive you can translate this script to Perl. I hope you appreciate the effort.

This script first searches for all the links, then for each link it searches for the guid and part of the tag content.

import re

sample_str = """
<a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812}' onclick='GoToLink(this);return false;'>Customize &quot;Sample List&quot;</a>
bla bla
<a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={21M31F46-937B-88B3-U7Z1-99DFJZ9N249A}' onclick='GoToLink(this);return false;'>Another &quot;This is it&quot;</a>
"""

links = re.findall('<a .*?</a>', sample_str)

for link in links:
    print 'link:'
    print '    ' + link
    print 'list:'
    print '    ' + re.search('List={([^}]*)}', link).group(1)
    print 'quoted text:'
    print '    ' + re.search('>[^<]*&quot;([^<]+)&quot;[^<]*</a>', link).group(1)
    print ''

The output for this script will be:

link:
    <a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812}' onclick='GoToLink(this);return false;'>Customize &quot;Sample List&quot;</a>
list:
    2A1D7816-6AC1-4B3B-B9E9-9EEF1B31F812
quoted text:
    Sample List

link:
    <a href='/mysite/test/sample2/_layouts/ListEdit.aspx?List={21M31F46-937B-88B3-U7Z1-99DFJZ9N249A}' onclick='GoToLink(this);return false;'>Another &quot;This is it&quot;</a>
list:
    21M31F46-937B-88B3-U7Z1-99DFJZ9N249A
quoted text:
    This is it

If you have python you can easily run the script with python scriptname.py on the command line.

gitaarik
  • 42,736
  • 12
  • 98
  • 105