-2

I have a file of almost ~215 MB and it has data as in below format:

<Something>
....
 <Document>
  </Document>
...
</Something>

Now i want to fetch chunks of data which are between Document. I created a regex as below:

@"<DOCUMENT></DOCUMENT>"

How to get chunks of data from large file to have above mentioned data:

I tried using StreamReader but I am not sure what is the best and fastest way.

Neel
  • 11,625
  • 3
  • 43
  • 61
  • 2
    It looks like `XML`, so you could use `XmlReader` for this. See this link: https://msdn.microsoft.com/en-us/library/cc189056(v=vs.95).aspx – daniel59 Mar 17 '16 at 07:31
  • its not xml file..format is like xml – Neel Mar 17 '16 at 07:32
  • Then you could also use the `XmlReader` or similar classes. It is independend of the file-ending – daniel59 Mar 17 '16 at 07:33
  • 1
    So, how does the file look like? From the part you showed it can be assumed it's xml, but you say it's not. Give a better sample of your data – derpirscher Mar 17 '16 at 07:37
  • To add to what CoreDeveloper wrote... If `var xdoc = new XmlDocument(); xdoc.Load("yourfilename");` works without throwing an `Exception` then it is an xml document. If it is an xml document, then you can begin working with `XmlReader` to read it piece by piece. – xanatos Mar 17 '16 at 08:03
  • No it would not be on same line ever @HenkHolterman, they have many things in between – Neel Mar 17 '16 at 08:07
  • its a dissem file @xanatos – Neel Mar 17 '16 at 08:08
  • its a combination of so many other files @HenkHolterman and it has xml kind of format – Neel Mar 17 '16 at 08:16
  • Have you tried just reading the file line by line and extracting the chunks between those delimiters? – Lasse V. Karlsen Mar 17 '16 at 09:11
  • yes i already tried but in that some information are dropped, like for line by line, a line is not sufficient for the data i want and for chunks its disconnecting a format of data it has. @LasseV.Karlsen – Neel Mar 17 '16 at 09:14
  • That comment makes little sense to be honest. I've posted an answer, please let us know if that would be sufficient or not. – Lasse V. Karlsen Mar 17 '16 at 09:16
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Aron Mar 18 '16 at 08:42
  • Its not what i was looking for so please remove @Aron – Neel Mar 18 '16 at 08:49
  • 1
    @CoreDeveloper You really aren't helping yourself. Your question is clearly a "How do I extract data from an XML large file". Until such time as you edit your question to show how it is "different" from that question, you will continue to receive these answers. – Aron Mar 18 '16 at 08:53

1 Answers1

1

Here is a simple piece of code that would do what you want.

It will:

  1. Read the file line by line (so won't read 215MB into memory)
  2. It will gather up each chunk by itself, and "output it" when the end of a chunk is reached

Here is the code:

bool inDocument = false;
var chunk = new List<string>();
foreach (var line in File.ReadLines(@"D:\Temp\largefile.txt"))
{
    switch (line.Trim())
    {
        case "<Document>":
            inDocument = true;
            break;

        case "</Document>":
            inDocument = false;
            if (chunk.Count > 0)
            {
                // Output chunk
                chunk.Clear();
            }
            break;

        default:
            if (inDocument)
                chunk.Add(line);
            break;
    }
}
Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825