1

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"

This is a sample XML file I'm reading:

                <Consequence_Note>
                    <Text>In some cases, integer coercion errors can lead to exploitable buffer
                        overflow conditions, resulting in the execution of arbitrary
                        code.</Text>
                </Consequence_Note>

and this

<Consequence_Scope>Availability</Consequence_Scope>
                    <Consequence_Technical_Impact>DoS: resource consumption
                        (CPU)</Consequence_Technical_Impact>

My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.

I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"

For example:

                <Consequence_Technical_Impact>DoS: resource consumption
                    (CPU)</Consequence_Technical_Impact>

What I would like to have is:

<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>

This is my code (I'm reading from a xml file):

String file = @"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, @">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);  
yyc2001
  • 113
  • 2
  • 15
  • 5
    You should use an XML parser. – SLaks Feb 24 '12 at 00:38
  • @SLaks I wish I can use the XML parser. The XML I'm reading is a very huge file, and the format of this XML file will change once every three month. So it is not a great idea to parse it out. – yyc2001 Feb 24 '12 at 00:45
  • 2
    That doesn't make any difference. You can use LINQ to XML to easily handle arbitrary formats, and it will be much easier and more reliable than a regex. – SLaks Feb 24 '12 at 00:48
  • 1
    Indeed; If this is such a large file then using a Regex will have to load the entire file into memory to do its work. LinqToXml will retain the chunked reading effect that is so beneficial for large files and it is the more sensible approach; you can use regex to do what you're planning but it's a very brute-force approach. Also; why does this file change so much ? That goes against the benefit of using XML in the first place, so perhaps there is a better encoding you can use ... – Russ Clarke Feb 24 '12 at 00:57
  • Your example doesn't match the pattern, that's why it's not working. There is no `>` followed by *just* spaces/newline/etc and then a `<`, there is text inside. – mathematical.coffee Feb 24 '12 at 01:15
  • Are you sure that removing all CRs, LFs and tabs is what you want to do? The examples suggest that you really mean to change all runs of whitespace to single spaces. (That would be simpler, because you can safely do that inside tags as well, so you don't have to check if you're in a tag or not.) By the way, `>` and `<` are not tags. `` and `` are. – Mr Lister Feb 24 '12 at 08:03

2 Answers2

1

Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!

Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.

The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.

Community
  • 1
  • 1
TcKs
  • 25,849
  • 11
  • 66
  • 104
0

If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.

See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx

A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:

\s{2,} 

replace with

[ ]

That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

jessehouwing
  • 106,458
  • 22
  • 256
  • 341