2

I have to read an XML file, that has no root element, to extract contained data. The XML has many elements like these:

<DocumentElement>
  <LOG_x0020_ParityRate>
    <DATE>12/09/2017 - 00:00</DATE>
    <CHANNELNAME>ParityRate</CHANNELNAME>
    <SQL>update THROOMDISP set ID_HOTEL = '104', ID_ROOM = '920', NUM = '3', MYDATA = '20171006' where id_hotel =104 and id_room ='920' and MYDATA ='20171006'</SQL>
    <ID_HOTEL>104</ID_HOTEL>
    <TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
  </LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>
  <LOG_x0020_ParityRate>
    <DATE>12/09/2017 - 00:00</DATE>
    <CHANNELNAME>ParityRate</CHANNELNAME>
    <SQL>update THROOMDISP set ID_HOTEL = '105', ID_ROOM = '923', NUM = '1', MYDATA = '20171006' where id_hotel =105 and id_room ='923' and MYDATA ='20171006'</SQL>
    <ID_HOTEL>105</ID_HOTEL>
    <TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
  </LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>
  <LOG_x0020_ParityRate>
    <DATE>12/09/2017 - 00:00</DATE>
    <CHANNELNAME>ParityRate</CHANNELNAME>
    <SQL>update THROOMDISP set ID_HOTEL = '104', ID_ROOM = '920', NUM = '3', MYDATA = '20171007' where id_hotel =104 and id_room ='920' and MYDATA ='20171007'</SQL>
    <ID_HOTEL>104</ID_HOTEL>
    <TYPEREQUEST>updateTHROOMDISP(OK)</TYPEREQUEST>
  </LOG_x0020_ParityRate>
</DocumentElement><DocumentElement>

I tried to read it as a string, add manually opening and closing tags, and parse it like an XDocument, but it has also some bad formatted tags, like these

</DocumentElement>
<TYPEREQUEST>updateTHROOMPRICE(OK)</TYPEREQUEST>

Where these tags doesn't match any opening tags, and when I call XDocument.Parse on the resulting string I have exceptions. The file has millions of rows, so I can't read it line by line, or the iteration will last for hours. How can I get rid of all these bad formatted tags and parse the document?

Kalamarico
  • 5,466
  • 22
  • 53
  • 70
Alfredo Torre
  • 678
  • 1
  • 9
  • 25
  • correct your XML? – SᴇM Sep 14 '17 at 13:44
  • 1
    Basically, you're not trying to read an XML file. You're trying to read a file which is a bit like XML, but not quite. I would strongly advise you to work upstream to find out what's meant to be creating XML (but failing to do so) and get that fixed if *at all* possible. – Jon Skeet Sep 14 '17 at 13:47
  • I receive this XML and I haven't control on the creation of the file – Alfredo Torre Sep 14 '17 at 13:48
  • If you don't want to change the format to be XML then try writing a parser for your XML-like format. You can use something like [Sprache](https://github.com/sprache/Sprache). – Dustin Kingen Sep 14 '17 at 13:52
  • Thanks @Romoku, I'll Give it a read – Alfredo Torre Sep 14 '17 at 13:54

3 Answers3

2

You xml is simply not well formed which often happens when xml data is merged together. Your xml has multiple tags at root level so use XML reader like below :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication4
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            XmlReaderSettings settings = new XmlReaderSettings();
            settings.ConformanceLevel = ConformanceLevel.Fragment;
            XmlReader reader = XmlReader.Create(FILENAME,settings);
            while (!reader.EOF)
            {
                try
                {
                    if (reader.Name != "LOG_x0020_ParityRate")
                    {
                        reader.ReadToFollowing("LOG_x0020_ParityRate");
                    }
                    if (!reader.EOF)
                    {
                        XElement parityRate = (XElement)XElement.ReadFrom(reader);

                        ParityRate newLog = new ParityRate();
                        ParityRate.logs.Add(newLog);
                        newLog.date = DateTime.ParseExact((string)parityRate.Element("DATE"), "MM/dd/yyyy - hh:mm", System.Globalization.CultureInfo.InvariantCulture);
                        newLog.name = (string)parityRate.Element("CHANNELNAME");
                        newLog.sql = (string)parityRate.Element("SQL");
                        newLog.hotel = (int)parityRate.Element("ID_HOTEL");
                    }
                }
                catch (Exception ex)
                {
                }
            }
        }
    }
    public class ParityRate
    {
        public static List<ParityRate> logs = new List<ParityRate>();

        public DateTime date { get; set; }
        public string name { get; set; }
        public string sql { get; set; }
        public int hotel { get; set; }
    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • Thanks, I tried your code and it looks great, the problem is that I have an XmlException "unexpected closing tags at row 69690" when it hits a bad formed row with `EQUEST>` – Alfredo Torre Sep 14 '17 at 14:55
  • The code you posted had one extra tag at the end that I deleted : . I thought it was just a copy error you made when you added xml to posting. You can add an exception handler so the code continues after getting the exception. – jdweng Sep 14 '17 at 15:04
  • 1
    How can I move to next row in case of exception? I tried to use in the catch block MoveToNextElement, ReadToNextSibling, Skip, but it stays in the bad formed row – Alfredo Torre Sep 14 '17 at 15:17
  • I modified code to add exception handler to continue after an exception – jdweng Sep 14 '17 at 15:44
  • 1
    Wrong solution. Fix the xml. – David Heffernan Sep 15 '17 at 06:45
  • 1
    David : You are absolutely wrong again. Log files often use xml format. You do not want to keep on adding new log data to the file by opening file and adding to a root. It takes a lot of time. Instead you just append new data to end of file which is much quicker but gives multiple roots. I've seen files like this all the time. Why do you thing the xmlreader settings have an option for fragments? – jdweng Sep 15 '17 at 06:57
  • 1
    Appending properly isn't hard. – David Heffernan Sep 15 '17 at 07:59
  • 1
    It is very time consuming to open and close large files in windows. When you are logging you do not want to spend time to save results. – jdweng Sep 15 '17 at 08:04
  • 1
    Nope. Open file. Seek to end. Read back from end. Append. No problems. – David Heffernan Sep 15 '17 at 21:39
  • 1
    Big waste of time and processing speed that isn't needed. Hundred of Thousands application don't do that. Especially when log files get into the million of lines and 100 of megabytes. And will create unnecessary fragmentation of the disk. – jdweng Sep 16 '17 at 01:45
  • 1
    No it won't. It's no different from what you propose. Other than keeping the file as valid xml. You clearly don't understand what I am saying. Doesn't matter. – David Heffernan Sep 16 '17 at 03:56
  • 1
    You don't understand reality. The OP has no control over the format of the input. The OP has a standard XML file the has no issues. Why are you asking the OP to do something they cannot do? – jdweng Sep 16 '17 at 09:11
  • @AlfredoTorre, it doesn't move to the next node, because after exception ReadState = Error, and you can't change it, because it's readonly property. – Alexan Sep 20 '17 at 19:11
  • Alfredo : Are you sure? Even though it has ReadState equal Error. It is not EOF. So the code should move in the next loop to reader.ReadToFollowing("LOG_x0020_ParityRate");. If it doesn't find the LOG.... then it will get EOF and exit while loop. If what you said is true then code will never exit while loop. – jdweng Sep 21 '17 at 00:05
  • @jdweng, yes, it never exits the loop. – Alexan Sep 21 '17 at 18:21
  • Can you check two things? 1) Is reader == null? 2) What is reader.Name? The only thing I can see to keep code in loop is if reader becomes null or there is an invalid reader.Name. – jdweng Sep 21 '17 at 19:19
  • I have one more theory. You r streamReader will default to Ascii Encoding which will ignore any non printable characters. I don't think the XmlReader will do the same. So if the file contained a null (0x00) it could account for the error. So Name would be the string 0x00 which may be giving the error. – jdweng Sep 21 '17 at 19:34
  • @jdweng, reader.Name is exactly this bad tag, in my case time without closing > – Alexan Sep 21 '17 at 20:30
  • Ever wilder hypotheses, and yet never contemplating the obvious. Namely that the asker speaks the truth. – David Heffernan Sep 22 '17 at 05:20
  • I was able to finally repeat problem by leaving out a ">". The root problem is the ReadToFollowing() wasn't moving the reader forward because the code was already at the tag, and the ReadFrom() wasn't advancing the reader due to the node having an error. Wasn't obvious until I was able to duplicate issue. Now trying to find a fix. The Op conclusion was wrong the reader didn't move due to state = error. – jdweng Sep 22 '17 at 06:13
  • The first comment to your answer explains that. – David Heffernan Sep 22 '17 at 07:01
1

I found a way to solve my problem, I gave up to read it as an XML and I read it as a StreamReader, looking for the text I want to read, so I don't have to fight against the XML format

using (StreamReader strReader = File.OpenText(path))
            {
                while (!strReader.EndOfStream)
                {
                    string line = strReader.ReadLine();
                    if (line.Contains("<LOG_x0020_ParityRate>")) {
                        line = strReader.ReadLine();
                        string data_ = getTagText(line);
                        string channelName_ = getTagText( strReader.ReadLine());
                        string sql_ = getTagText( strReader.ReadLine());
                        string idHotel_ = getTagText(strReader.ReadLine());
                        string type_ = getTagText(strReader.ReadLine());


                    }


                }
            }
Alfredo Torre
  • 678
  • 1
  • 9
  • 25
0

You can try to use XmlParser:

A Roslyn-inspired full-fidelity XML parser with no dependencies and a simple Visual Studio XML language service.

It pars any bad formed xml.

Alexan
  • 8,165
  • 14
  • 74
  • 101