1

I want to create a program using .Net to read or search data in a 20Gb CSV file

Is there any way to do it ?

My Code For Search

string search = txtBoxSearch.Text;
string pathOnly = Path.GetDirectoryName(csvPath);
string fileName = Path.GetFileName(csvPath);

string sql = @"SELECT F1 AS StringID, F2 AS StringContent FROM [" + fileName + "] WHERE F2 LIKE '%" + search + "%'";

using (OleDbConnection connection = new OleDbConnection(
        @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + pathOnly +
        ";Extended Properties=\"Text;HDR=No\""))
using (OleDbCommand command = new OleDbCommand(sql, connection))
using (OleDbDataAdapter adapter = new OleDbDataAdapter(command))
{
        DataTable dataTable = new DataTable();
        adapter.Fill(dataTable);
        dataTable.Columns.Add("MatchTimes", typeof(System.Int32));

         foreach (DataRow row in dataTable.Rows)
         {
                 row["MatchTimes"] = Regex.Matches(row["StringContent"].ToString(), search).Count;
         }

         GridViewResult.DataSource = dataTable;
         GridViewResult.DataBind();

My Code for generate the CSV File

int records = 100000;

File.AppendAllLines(csvPath, 
   (from r in Enumerable.Range(0, records) 
      let guid = Guid.NewGuid() 
      let stringContent = GenerateRandomString(256000) 
      select $"{guid},{stringContent}"));
Player 2nd
  • 211
  • 3
  • 7
  • 2
    _Is there any way to do it ?_ It seems to me that you have done it. What is your problem with the code shown? – Steve Nov 17 '20 at 14:54
  • Euh your Csv has only 2 columns? and doesn't have coma `,` in the data. So simple string operation should be enough instead of a full `OleDbConnection`. And you should add only the line that match to your grid. You can even use the String Split [overload](https://learn.microsoft.com/en-us/dotnet/api/system.string.split?view=net-5.0#System_String_Split_System_Char___System_Int32_) where you specify the number of expect result so you will ignore everything the needed column – xdtTransform Nov 17 '20 at 15:03
  • @Steve I got an error System.Data.OleDb.OleDbException: 'The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result.' Because the CSV file is to huge – Player 2nd Nov 17 '20 at 15:06
  • yeah never gona happen.... lol well maybe but not really how many rows is 20Gb.... you would need to import this into a db.. or something better at doing index based searchs or something. otherwise u would have to read as stream.... somehow... 20GB is a lot that like millions of rows.... 200 million + – Seabizkit Nov 17 '20 at 15:09
  • Yes! because why OleDB for a simple CSV? I mean a simple `var lines = File.ReadLines(fileName); foreach (var line in lines){ // split on comma take the 2nd element index[1] // regex match // add to the source } ` – xdtTransform Nov 17 '20 at 15:09
  • `var lines = File.ReadLines(fileName);` @xdtTransform you will run out of memory before that completes, well unless you have like 64Gig.. even so i wonder how long that would take to read 20 gigs into mem.... sometime... – Seabizkit Nov 17 '20 at 15:10
  • @Seabizkit, pretty sure it implement StreamReader and give an IEnumerable. with a fiwed buffer of 1024. it's [File.ReadLines Method](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines?view=net-5.0) not Read**All**lines – xdtTransform Nov 17 '20 at 15:11
  • @xdtTransform apologies you are correct – Seabizkit Nov 17 '20 at 15:14
  • If the amount of match is reasonable something like this may be enought : `var separators= new []{','}; var lines = File.ReadLines(fileName) .Split(separators,2) [1] .Where( compiled regex match) .ToList();` . But you do more that one search per day, it's time to get a database – xdtTransform Nov 17 '20 at 15:18
  • @Player2nd you need to explain better the use case for this data. You need to read it just one time? You need to search it, just for a single string? The best solution depends on how you are supposed to use the data. You could simply read and split as suggested but if you need to do anything more complex you should load everything in a database and use its function to index that information – Steve Nov 17 '20 at 15:18

1 Answers1

0

This really depends on exactly how you're searching. If you're just doing a single search, you could simply read this one line at a time and do a string comparison or something. If you do this, do not load the whole thing into memory - load it one at a time.

If you have access to the "full" edition of SQL Server, you could do a BULK INSERT. If you don't, though (e.g. you're using one of the express editions), you might run into the maximum table size. In this case, I've never tried this, but you could try SQLite. In theory at least, the database can handle multiple terabytes. Be sure to insert a large number of records in each transaction, though; if you do a commit after each insert your performance will be absolutely wretched. Also, be sure that you're not creating an in-memory database, or you'll just run out of memory again.