-1

I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.

I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.

I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.

        var regex = new Regex(@"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);

        Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
        int matchCount = 0;
        while (m.Success)
        {
            Console.WriteLine("Match" + (++matchCount));
            for (int i = 1; i <= 2; i++)
            {
                Group g = m.Groups[i];
                //do stuff...
            }
            m = m.NextMatch();
        }
Laguna
  • 3,706
  • 5
  • 27
  • 42
  • 8
    Why not using `XDocument`? – Hossein Narimani Rad Apr 17 '13 at 18:29
  • 6
    Any reason you can't use an XML parser? LINQ to XML (`XDocument`), for instance? – Oded Apr 17 '13 at 18:29
  • it would be easier w/ xdoc and xpath, but that unfortunately is out of the question due to circumstance. – Laguna Apr 17 '13 at 18:30
  • only regular expression is allowed :( – Laguna Apr 17 '13 at 18:30
  • 1
    That is a very strange limitation to put you in - can you explain it further? Why must it be Regex? – Oded Apr 17 '13 at 18:32
  • One of the most voted answer on this site http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags take a look and rethink about your limitations – Steve Apr 17 '13 at 18:33
  • This is essentially non-valid question for SO - only code that is commonly acceptable (i.e. without SQL injection) should be posted as an answer - you are asking for something that is really bad practice and hard/theoretically impossible to do properly. Please try to find bad solutions yourself. – Alexei Levenkov Apr 17 '13 at 18:33
  • 2
    @AlexeiLevenkov I think it's a valid question - it's something lots of people keep trying to do, and having an answer somewhere which explains *why* it's a bad idea would be useful. Unfortunately, none of the answers to the already linked potential duplicate (*particularly* the first one) actually explain why it can't be done. There's a number of answers saying "you can't do it" without a reason and lots of attempts using a regex which clever people having picked a hole in, but nothing which actually explains, fundamentally, what the problem is. – Philip Kendall Apr 17 '13 at 18:42
  • @PhilipKendall *the* duplicate question ("RegEx match open tags except XHTML...") have several real answers beyond funny one going into details why it is hard. So if one really interested - it is good read with a lot of links. – Alexei Levenkov Apr 17 '13 at 18:48
  • 1
    You know, if enough experts say, "don't do it", then that really _is_ the answer. – John Saunders Apr 17 '13 at 18:49
  • You're going to need to match multiple lines using a single regex; this is typically not the default behavior of regex's. – David R Tribble Apr 17 '13 at 21:27

1 Answers1

2

In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.

That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.

In your situation, you have essentially:

<Flight>
    <AirlineCode>
    </AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
    <AirlineCode>
    </AirlineCode>
<Flight>

And you want all of the <AirlineCode> tags that occur within <Flight> tags.

The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.

If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351