Reading more 1gb file and store content in string / list or whatever which can handle easy in RAM

Question

My code is below and just used too much of much and file size right now is 700 mb in txt format

StringBuilder dogs = new StringBuilder();
StreamReader str = new StreamReader(file);
while ((line = str.ReadLine()) != null)
{
    dogs.AppendLine(line);
}

can any suggest me to store file in any data types but I have to write file after reading and perform operation on it and storing in csv format line by line data

Do you really need to have it complete on memory? You could start reading it line by line and do whatever you want on it and write it back without reading it complete. Why do you want it all at a time? — fernando.reyes, Nov 28 '14 at 14:20
What sort of file is this? Is this file maintained internally? 700mb of a file is huge. Break into multiple smaller files and then read them. — Azhar Khorasany, Nov 28 '14 at 14:22
I have two file and have to compare both file. Let us suppose an example both files contains emails and now I have remove all emails from first file which is present in second file and second may be 1gb in size but first email file always more than 1gb. So that's why I have to save second file in memory for camparisions — Raman Singh, Nov 28 '14 at 14:22
Well, we don't know any details to be providing solutions, but it doesn't sound as though you need to have it all in memory at once. However, you clearly think differently. — David Heffernan, Nov 28 '14 at 14:40
Let take one scenario, u have to compare to file which have emails only and produce new file which has only email which is not present in second file. What did u do in this case..... Mr. David Heffernan — Raman Singh, Nov 28 '14 at 14:54

score 1 · Accepted Answer · answered Nov 28 '14 at 15:26

1

For your scenario with emails i will strongly recommend you to use any SQL database.

You should read and parse first file into database table line-by-line and use SQL queries to search emails from second file. Or you even can parse both files into separate tables and use SQL query to get similar records.

If you do not want to bother with SQL queries and MS Access i will recommend you to use SQLite and sqlite-net ORM library.

answered Nov 28 '14 at 15:26

xakpc

1,709
13
27

can MS Access handle 1 GB of data or 1 million's of emails – Raman Singh Nov 28 '14 at 15:49
@user2431786 did you try it? It will handle it depending on the resources of the machine you are using, disk space and RAM. – hollystyles Nov 28 '14 at 15:56

score 0 · Answer 2 · answered Nov 28 '14 at 14:25

0

I will suggest to read file line by line, process data in each line and the write it to another file stream that way it is not require to have complete data into memory.

If it is require to have past lines data available to process data in current line or if you require to go through all lines to extract some info then I will suggest to save each line into a database and then process data / update rows in database and at last retrieve again to prepare csv file.

answered Nov 28 '14 at 14:25

Morbia

4,144
3
21
13

did saving 1gb file in database is convenient way because after saving I have to compare with another file and that would be line by line read in streamreader – Raman Singh Nov 28 '14 at 14:28
if you have some unique identifier for each row comparing will be much faster in DB. – Morbia Nov 28 '14 at 14:32
is it faster to save each line in DB then save in any datatype – Raman Singh Nov 28 '14 at 14:38
It want be faster to save in DB then have it in memory but you will not have any out of memory related issue even in future when those files may grow. – Morbia Nov 28 '14 at 14:55
K I got ur point , Is MS Access database can handle 2 GB of Data?? – Raman Singh Nov 28 '14 at 14:57
if you can have an unique identifier for each email then you can only store that identifier from one file in the List in memory and then check in that List to find duplicate while reading another file, that way you are not storing full data into memory. – Morbia Nov 28 '14 at 14:58
that's what I am trying because file is huge 745 mb List consume more than 1.5GB of RAM and still raising, so that's why i end process – Raman Singh Nov 28 '14 at 15:13
Even if List just for identifier goes to 1.5GB, I would go for DB and use SQL server with SqlBulkCopy. – Morbia Nov 28 '14 at 15:18
k thnx I will try... and thnx for spending time on this conversation – Raman Singh Nov 28 '14 at 15:21

score 0 · Answer 3 · answered Nov 28 '14 at 14:54

0

On a 64 bit system with enough RAM, this should be fine:

List<string> dogs = new List<string>();
StreamReader str = new StreamReader(file);
while ((line = str.ReadLine()) != null)
{
    dogs.Add(line);
}

answered Nov 28 '14 at 14:54

Henrik

23,186
6
42
92

I tried on 32 bit machine but still consume more than 1.5GB of RAM for 745 mb of file – Raman Singh Nov 28 '14 at 15:06
Well, the doubling in size is probably due to ANSI --> Unicode conversion. – Henrik Nov 28 '14 at 15:10
any other suggest which consume less space – Raman Singh Nov 28 '14 at 15:14

score 0 · Answer 4 · edited May 23 '17 at 10:26

Here is a brute force version. The bad thing is you are iterating all file2 lines for every line in file1. But you would be doing that in memory too. Best solution is to import files into an RDBMS where you can use indexes.

Is this a one off exercise? What about using a file diff tool such as WinDiff or Beyond Compare?

Or how about this: .bat file to compare two text files and output the difference

using System.IO;   

 class Program
    {
        static void Main(string[] args)
        {
            string line1;
            string line2;

            using (var fileout = new StreamWriter(@"C:\test\matched.txt"))
            {
                using (var file1 = new StreamReader(@"C:\test\file1.txt"))
                {
                    while ((line1 = file1.ReadLine()) != null)
                    {
                        using (var file2 = new StreamReader(@"C:\test\file2.txt"))
                        {
                            while ((line2 = file2.ReadLine()) != null)
                            {
                                if (line1 == line2)
                                {
                                    fileout.WriteLine(line1);
                                }
                            }
                        }
                    }
                }
            }
        }
    }

k thnx man for u suggestion.. but what I am doing now because file size is too large so sorting of second file when reading all file data which takes less comparison when first file data read and compare to another file.. so that why I cann't read second file again and again because it's consume too much time then.... — Raman Singh, Nov 28 '14 at 16:09

Ferruccio · Answer 5 · 2014-11-28T17:13:18.010

When you read in the emails from the comparison file, instead of storing the contents of each email, you could compute a hash value for each email and store that instead.

Now when you read emails from the other file, you again compute a hash value for each email and search your list of hashes from the previous pass. If the hash is located, you know that the email was present in the first file.

Since hashes tend to be a lot smaller that the original text (SHA-1 hashes are, for example, 140 bytes each) the collection of hashes should easily fit in RAM.

The following example assumes that emails are stored one per line of text.

using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;
using System.Text;

var exclude = new List<byte[]>();

var sha1 = new SHA1CryptoServiceProvider();

// read exclusion emails
using (var sr = new StreamReader("exclude-file")) {
    string email;
    // assume one email per line of text
    while ((email = sr.ReadLine()) != null) {
        exclude.Add(sha1.ComputeHash(new MemoryStream(Encoding.UTF8.GetBytes(email))));
    }
}

// read emails
using (var sr = new StreamReader("email-file")) {
    string email;
    // again, assume one email per line of text
    while ((email = sr.ReadLine()) != null) {
        if (exclude.Contains(sha1.ComputeHash(new MemoryStream(Encoding.UTF8.GetBytes(email))))) {
            // exclusion file contains email
        } else {
            // exclusion file does not contain email
        }
    }
}

i am trying to compare same file with ur code but if statement never occur true even in same file... — Raman Singh, Nov 29 '14 at 08:07

Reading more 1gb file and store content in string / list or whatever which can handle easy in RAM

5 Answers5