1

I have an XML document where I wish to extract certain text contained in specific tags such as-

<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>

In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-

# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
    print("\nMatches found!\n")
    for title in title_text:
        print(title)
else:
    print("\nNo matches found!\n\n")

It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-

<title>Four-minute warning</title>

My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.

Thanks for your help!

Arun
  • 2,222
  • 7
  • 43
  • 78
  • 1
    [Don't use regex to parse XML.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – 3limin4t0r Nov 09 '18 at 22:12
  • I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error. – Arun Nov 09 '18 at 23:27
  • You could read the file as sting and wrap the contents into a root tag. `valid_xml = f'{xml_file_contents}'`. Then use the result as input for ElementTree. – 3limin4t0r Nov 09 '18 at 23:51
  • @Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any `` tag inside a pair of `<title>...` tags, which **is permitted by XML**. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible `` tags (e.g. `<title xmlns:blabla="...">`) – Luis Colorado Nov 15 '18 at 10:06

1 Answers1

1

Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:

import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning
benvc
  • 14,448
  • 4
  • 33
  • 54
  • 2
    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) or something similar. – benvc Nov 09 '18 at 22:02