0

I'm trying to match lines with the xs:element tag that only contain minOccurs. As seen below some of them contain both search criteria on one line, some of them span multiple lines. Is there a way of selecting them using grep and regular expressions.

<xs:element name="shipto">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="address" type="xs:string"/>
      <xs:element name="city" minOccurs="1" type="xs:string"/>
      <xs:element name="country" 
               minOccurs="1" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

The correct output should be as follows:

<xs:element name="city" minOccurs="1" type="xs:string"/>
<xs:element name="country" 
               minOccurs="1" type="xs:string"/>
stuff22
  • 1,662
  • 4
  • 24
  • 42

2 Answers2

2

I advise against parsing XML using regex. It is too complicated to match tags with end-tags in a robust way.

There is a command line tool "xpath" using XML::XPath in Perl (Ubuntu package libxml-xpath-perl). Example:

xpath -e '//*[@minOccurs=1]' file.xml

Output

-- NODE --
<xs:element name="city" minOccurs="1" type="xs:string" />
-- NODE --
<xs:element name="country" minOccurs="1" type="xs:string" />
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
David Andersson
  • 755
  • 4
  • 9
1

Assuming well-formed XML (i.e. no un-escaped > inside attributes) then you can probably do this:

<xs:element[^>]+?\sminOccurs\s*=[^>]+>

However, I'm not sure this will work with grep, since grep matches individual lines, so you may need to write a perl script or something to do it.

(Note, if you somehow have attributes which contain the value sminOccurs= then you'd need to get cleverer, but since this appears to be address data, I'm assuming that's unlikely, and manually removing any that happen to occur isn't going to be a problem.)

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
  • This worked for the element that was broken up between two lines using grep -P. But didn't match the first line where minOccurs was on the same line as the xs:element. – stuff22 Aug 28 '11 at 18:58
  • Hmmm, not sure what's up with that. Since `-P` uses Perl then all that syntax is definitely valid, and should work. :/ I guess you could try `]+>` but that may be being too restrictive. – Peter Boughton Aug 28 '11 at 19:06
  • Phew! You had me baffled for a while. :) – Peter Boughton Aug 28 '11 at 19:30
  • Peter, how would a regex look where I was match an xs:element that *didn't* have *minOccurs* in it? – stuff22 Aug 29 '11 at 16:40
  • This should do it: `])+>` – Peter Boughton Aug 29 '11 at 21:23
  • Thanks! I also posted the question here. http://stackoverflow.com/questions/7223119/selecting-text-spanning-multiple-lines-using-grep-and-regular-expressions/7223154#7223154 – stuff22 Aug 29 '11 at 23:33
  • Also use the -Pzo flags for grep. More explained here: http://stackoverflow.com/questions/3717772/regex-grep-for-multi-line-search-needed/7167115#7167115 – stuff22 Aug 29 '11 at 23:46