1

I have a bunch of code in a text file on my computer. I'm interested in two different types of codes in the file. They are:

<string>objectiwant1 <string2>objectiwant2</string2></string>

and

<string>objectiwant1 </string>

The first one would return [(objectiwant1, objectiwant2)] (with more tuples if they exist) while the second one would return [(objectiwant1, None)].

I'm trying to create a regular expression and the flawed code I have so far looks something like this:

regularexpression = r'<string>(.*) <string2>(.*)</string2>'

I'm using "re.findall(regularexpression, file)" to return the data. Which returns what I want only if both string and string2 are used. Using:

regularexpression = r'<string>(.*) (<string2>(.*)</string2>)|(</string>)

Returns everything within the larger parentheses, sometimes twice (as opposed to only the data within (.*), which are necessary to seperate the statements I want to compare with the OR operator.

I'm wondering whether or not there is something I could use to separate the parenthesis which wouldn't cause re.findall to output data twice and output so much data at once.

I'm also wondering whether there is a way to use regex to output data if a statement is not fulfilled (so if the objectiwant2 doesn't exist, I get to choose what the output is).

Thank you in advance.

agf
  • 171,228
  • 44
  • 289
  • 238
eltb
  • 107
  • 3
  • 10
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  Nov 11 '15 at 18:41

1 Answers1

1

You want a non-capturing group zero or one times:

>>> regular_expression = r'<string>(.*) (?:<string2>(.*)</string2>)?</string>'
>>> re.findall(regular_expression,
               "<string>objectiwant1 <string2>objectiwant2</string2></string>")
[('objectiwant1', 'objectiwant2')]
>>> re.findall(regular_expression, 
               "<string>objectiwant1 </string>")
[('objectiwant1', '')]
agf
  • 171,228
  • 44
  • 289
  • 238
  • You probably also want the `(.*)` to be non-greedy, otherwise this won't work when there are multiple tags on the same line. – Paulo Almeida Jul 30 '13 at 01:20
  • @PauloAlmeida Depends on the input, but that's probably a sane default. – agf Jul 30 '13 at 01:21
  • Thank you, this works perfectly. Is it possible to change the '' to output None instead of an empty string, though, like in search()? – eltb Jul 30 '13 at 01:35
  • @eltb Not sure why you need `None`, but you can always do `[match or None for match in re.findall(regular_expression, "objectiwant1 ")]` if you want to filter out the empty strings. – agf Jul 30 '13 at 03:53