0

I'm looking for a way to extract the information between the tags for a python script I'm working on. I've been able to use regex testers to isolate the piece that I want, but the re.search method doesn't work in my code. I'm limited to using the re.sub method along with split to get the information I'm after.

I've tried using re.search and it returns an error, so I've been using the re.sub method instead

 sub = re.sub('<.*?>',' ', line)
 sub = sub.split()

sample string:

 <CellValue Index="0"><FormattedValue>System Managed Accounts 
 Group</FormattedValue><Value>System Managed Accounts Group</Value> 
 </CellValue>

The above function is yielding the data from the right place, but is not returning all the info (it stops at the first space, how can I modify it to get the entirety of the text between the tags)

  • 1
    It gave me `['System', 'Managed', 'Accounts', 'Group', 'System', 'Managed', 'Accounts', 'Group']` on my system. That seems right. What is your current output and expected output? – Akaisteph7 Aug 06 '19 at 14:41
  • 4
    Why aren't you using proper XML/HTML parser? – Andrej Kesely Aug 06 '19 at 14:44
  • 1
    Have you taken a look at https://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python for simple XML parsing solutions? – PaSTE Aug 06 '19 at 14:44
  • I misdiagnosed the issue, my apologies. I'm able to get the items, but they are all different entries in a list.What I want is each of the unique items in a single string separated by commas. – amatthew382 Aug 06 '19 at 14:46
  • I'm not using a parser because the XML I'm working with is huge and isn't compliant with XML standard. I've basically already written a parser and this is the only part that isn't outputting what I need. – amatthew382 Aug 06 '19 at 14:47

1 Answers1

3

I prefer re.findall() to re.match() for this purpose, usually.

Something you might not realize is you can use parentheses in the regex expression to denote a "capturing group" (such that everything outside the groups is ignored). Some examples:

sample = '<CellValue Index="0"><FormattedValue>System Managed Accounts Group</FormattedValue><Value>System Managed Accounts Group</Value>  </CellValue>'

insideTags = re.findall(r'<(.*?)>', sample)
# ['CellValue Index="0"', 'FormattedValue', '/FormattedValue', 'Value', '/Value', '/CellValue']

openingTagsOnly = re.findall(r'<([^/]*?)>', sample)
# ['CellValue Index="0"', 'FormattedValue', 'Value']

betweenTags = re.findall(r'<.*?>([^<>]*?)</.*?>', sample)
# ['System Managed Accounts Group', 'System Managed Accounts Group']

If you're parsing HTML/XML you really should be using a module like beautifulsoup - see why regex cannot parse HTML/XML. But for the very simple example you provided, my latter example works by just getting whatever's between the closest pair of opening/closing tags such that there are no other tags in between.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
  • I appreciate the explanation. I'm still new to regex in general and I've never attempted to use beautiful soup. I'll definitely look into that in the future. Can you tell me if there's any flag in regex or method I could use to limit the return string to unique values? – amatthew382 Aug 06 '19 at 15:09
  • @amatthew382 I don't think so, but you could also then filter the returned group to remove duplicates by doing `betweenTags = list(set(betweenTags))`, though that comes at the expense of possibly disrupting the order – Green Cloak Guy Aug 06 '19 at 15:17
  • Thank You! That worked perfectly. Do you have any suggestions for further learning? I'm decent with basics, but OOP and beyond are still mysterious to me. I'm trying to get into the infosec / development world. – amatthew382 Aug 06 '19 at 15:21