Fetching data from large files

Question

I have a file of almost ~215 MB and it has data as in below format:

<Something>
....
 <Document>
  </Document>
...
</Something>

Now i want to fetch chunks of data which are between Document. I created a regex as below:

@"<DOCUMENT></DOCUMENT>"

How to get chunks of data from large file to have above mentioned data:

I tried using StreamReader but I am not sure what is the best and fastest way.

It looks like `XML`, so you could use `XmlReader` for this. See this link: https://msdn.microsoft.com/en-us/library/cc189056(v=vs.95).aspx — daniel59, Mar 17 '16 at 07:31
Then you could also use the `XmlReader` or similar classes. It is independend of the file-ending — daniel59, Mar 17 '16 at 07:33
So, how does the file look like? From the part you showed it can be assumed it's xml, but you say it's not. Give a better sample of your data — derpirscher, Mar 17 '16 at 07:37
To add to what CoreDeveloper wrote... If `var xdoc = new XmlDocument(); xdoc.Load("yourfilename");` works without throwing an `Exception` then it is an xml document. If it is an xml document, then you can begin working with `XmlReader` to read it piece by piece. — xanatos, Mar 17 '16 at 08:03
No it would not be on same line ever @HenkHolterman, they have many things in between — Neel, Mar 17 '16 at 08:07
its a combination of so many other files @HenkHolterman and it has xml kind of format — Neel, Mar 17 '16 at 08:16
Have you tried just reading the file line by line and extracting the chunks between those delimiters? — Lasse V. Karlsen, Mar 17 '16 at 09:11
yes i already tried but in that some information are dropped, like for line by line, a line is not sufficient for the data i want and for chunks its disconnecting a format of data it has. @LasseV.Karlsen — Neel, Mar 17 '16 at 09:14
That comment makes little sense to be honest. I've posted an answer, please let us know if that would be sufficient or not. — Lasse V. Karlsen, Mar 17 '16 at 09:16
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Aron, Mar 18 '16 at 08:42
@CoreDeveloper You really aren't helping yourself. Your question is clearly a "How do I extract data from an XML large file". Until such time as you edit your question to show how it is "different" from that question, you will continue to receive these answers. — Aron, Mar 18 '16 at 08:53

score 1 · Accepted Answer · answered Mar 17 '16 at 09:15

1

Here is a simple piece of code that would do what you want.

It will:

Read the file line by line (so won't read 215MB into memory)
It will gather up each chunk by itself, and "output it" when the end of a chunk is reached

Here is the code:

bool inDocument = false;
var chunk = new List<string>();
foreach (var line in File.ReadLines(@"D:\Temp\largefile.txt"))
{
    switch (line.Trim())
    {
        case "<Document>":
            inDocument = true;
            break;

        case "</Document>":
            inDocument = false;
            if (chunk.Count > 0)
            {
                // Output chunk
                chunk.Clear();
            }
            break;

        default:
            if (inDocument)
                chunk.Add(line);
            break;
    }
}

answered Mar 17 '16 at 09:15

Lasse V. Karlsen

380,855
102
628
825

ok i will try, btw will it work if i have so many "Document" sections in file? – Neel Mar 17 '16 at 09:19
It will "// Output chunk" once for each such chunk of data, it will not combine them all into one big one. – Lasse V. Karlsen Mar 17 '16 at 09:20
you mean everytime i will get one chunk @output chunk, i can add it into a string right? – Neel Mar 17 '16 at 09:26
You can do whatever you want with it. If you want all those lines as one big string you can do `string.Join(Environment.Newline, chunk)`. – Lasse V. Karlsen Mar 17 '16 at 09:27
I don't understand your question. If you get everything between `` and ``, isn't that what you wanted? – Lasse V. Karlsen Mar 17 '16 at 09:57

Fetching data from large files

1 Answers1