0

Good Day,

Is there any alternative in getting everything inside a tag using regex. here is my code:

   MatchCollection matches = Regex.Matches(chek, "<bib-parsed>([^\000]*?)</bib-parsed>");

here is the sample input:

   <bib-parsed>
   <cite>
   <pubinfo>
   <pub-year><i>1984</i></pub-year>
   <pub-place>Albuquerque</pub-place>
   <pub-name>Maxwell Museum of Anthropology and the University of New Mexico Press        </pub-name>
   </pubinfo>
   <bkinfo>
   <btl>The Galaz Ruin: A Prehistoric Mimbres Village in Southwestern New Mexico</btl>
   </bkinfo>
   </bib-parsed>

that sample above will be matched but when there are "0's inside the pubyear like "2001" the matching fails. any alternative for this? thanks

  • 7
    Noooooooooooooooooooooooooooo! http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not, obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 I sense the pit opening... – Mitch Wheat Oct 18 '13 at 02:01

1 Answers1

6

It appears your input is valid XML. If this is the case, use the XML parsers in either System.Xml or System.Xml.Linq. They are extremely fast. For an input string containing multiple chunks like your example, using the System.Xml.Linq namespace objects:

var bibChunks = XDocument.Parse(yourXmlString)
                         .Descendants("bib-parsed")
                         .Select(e => e.Value);

foreach(string chunk in bibChunks) {
    // do stuff
}

That's all there is to it.

Joshua Honig
  • 12,925
  • 8
  • 53
  • 75