-1

I need to read in a xml file that isn't conform the xml rule's. So i need to make it right before i can read it as a xml file. It exist of symbols like "&" en "<" between the elements.

<MAT>
<MATERIAL><MATNR>2286303</MATNR><BESTELTXT>Parts for something & something else</BESTELTXT><WERKS>Material exist out of<1 something</WERKS>
</MAT>

For now i have this:

I read in the file then i do this

            text = Regex.Replace(text, @"\s&\s", " &amp; ");
            text = Regex.Replace(text, @"[<]\d+", "&lt;");

After it i write the text to file and this i read in as xml.

The problem with "<" is that it is removing the number and this i need to keep. Also i don't know if this is having a good performance? Also will this work with verry large file's? And it also only matches this case but what if we have in the future more case's? Isn't there a general way for changing those Predefined entities to their xml format?

ps: I know this should be handled when the xml file is made but it's coming from a thirth party and they can't change it.

Bram V
  • 41
  • 1
  • 2
  • 3
    They can't change it? Then change that vendor – Thomas Weller Nov 24 '16 at 15:07
  • They're not supplying you a valid XML file, you cant expect to read it as one. – Jamiec Nov 24 '16 at 15:10
  • 1
    @Thomas yeah that's very funny, but usually it's not the developer's place to decide that. – CodeCaster Nov 24 '16 at 15:11
  • @Jamiec I have to agree with you, but we don't have a choice so Like CodeCaster say's it's not my place to decide. I asked them and let them know but the answer was that it wasn't possible so i'm stuck with it. – Bram V Nov 25 '16 at 08:55
  • @BramV see I disagree 100% with codecaster. It *is* the responsibility of the developer IMO. If I am paid to be a developer, then someone is paying for my knowledge & experience. If that tells me they're using a shitty vendor, then darn right I'll make it clear to whoever is paying me. – Jamiec Nov 25 '16 at 09:27

1 Answers1

1

You should try this

text = Regex.Replace(text, @"(\s+)&(\s+)", "$1&amp;$2");
text = Regex.Replace(text, @"[<](\d+)", "&lt;$1");
  • First change is \s to \s+ to select & even if it surrounded by more than one space, but by using + it means at less one space.
  • Second change \d+ to (\d+) by doing that I was able to use $1 who contains the value of the selected number , the same thing work for \s+, if you have more than one selected group the order will be $1,$2 etc.
  • to Improve performance you can add RegexOptions.Compiled to your Regex, for exemple text = Regex.Replace(text, @"(\s+)&(\s+)", "$1&amp;$2",RegexOptions.Compiled);

Also if you want to change all & you have to remove (\s+)

Badro Niaimi
  • 959
  • 1
  • 14
  • 28
  • Instead of "try this" explain what you changed. – CodeCaster Nov 24 '16 at 15:11
  • and why you changed it – Thomas Weller Nov 24 '16 at 15:12
  • @CodeCaster is that enough? I didnt explain because those are basic things in Regex world – Badro Niaimi Nov 24 '16 at 15:22
  • 1
    If the OP were well-versed in the basic things in the Regex world, they wouldn't have to ask this question, would they? But yes, that's a nice explanation you added, have an upvote. – CodeCaster Nov 24 '16 at 15:27
  • that's a very good point, thank you. – Badro Niaimi Nov 24 '16 at 15:30
  • @BadroNiaimi Ty for this! It's doing the job perfectly! But i still have one unanswered question: "hat if we have in the future more case's? Isn't there a general way for changing those(ALL) Predefined entities to their xml format?" And one new question now i'm reading the file into a string using (File.ReadAllText) and write it back afterwards, is this good performance for large file's or are there better options? – Bram V Nov 25 '16 at 09:18
  • @BramV the best way to read an XML file using C# Language is to use [XmlSerializer](https://msdn.microsoft.com/fr-fr/library/system.xml.serialization.xmlserializer(v=vs.110).aspx) , using this class you will be able to Serialize (writing) and Deserialize (reading) your data from XML files, if you read your XML file using ReadAllText everything will be considered String so you have to manage it by changing those chars : `[" ' < > & ] ` – Badro Niaimi Nov 25 '16 at 11:46
  • Yeah i know how XmlSerializer works and i was using it but i think i will remove it for XmlReader so i can read every object at a time for big filles i guess this is better? The ReadAllText I am using to load the invalid xml so i could run the Regex on it we just discussed, so i can make it a valid xml file. But i guess loading a big file into a string and write it back afterwards isn't a good solution. – Bram V Nov 25 '16 at 14:46
  • If you have a big file you could read line by line , take a look to this [topic](http://stackoverflow.com/questions/37725050/reading-and-writing-very-large-text-files-in-c-sharp) , you can see this [topic](http://stackoverflow.com/questions/4500659/performance-xmlserializer-vs-xmlreader-vs-xmldocument-vs-xdocument) to have a clear idea about the performance of eachone of XmlSerializer, XmlReader ,XmlDocument & XDocument – Badro Niaimi Nov 25 '16 at 15:39