0

I'm storing what basically amounts to log data stored in CSV files. It's of the format <datetime>,<val1>,<val2>, etc. However, the log files are stored by account ID and month, so if you query across months or account IDs you're going to retrieve multiple files.

I'd like to be able to query it with LINQ, so that if I could call logFiles.Where(o => o.Date > 1-1-17 && o.Date < 4-1-17). I suppose I'll need something to examine the date range in that query and notice that it spans 4 months, which then causes it to only examine files in that date range.

Is there any way to do this that does not involve getting my hands very dirty with a custom IQueryable LINQ provider? I can go down that rabbit hole if necessary, but I want to make sure it's the right rabbit hole first.

Slothario
  • 2,830
  • 3
  • 31
  • 47
  • 2
    Sounds like you have enough data that you'd really benefit from having the data in a database, rather than in flat files. Trying to emulate a database's ability to efficiently search large and complex dataset across many different files, and allowing them to be processed efficiently, is not an easy problem. You're better off not trying to solve it yourself when there are good solutions already out there. – Servy Jul 23 '18 at 14:51
  • 1
    That's beyond my control -- my boss has decided to go for flat files because there's a large amount of data, very few reads, and we want to cut down the costs of running our database in Azure. – Slothario Jul 23 '18 at 14:55
  • There being a large amount of data is precisely what makes managing this in flat files a problem. If it was small, you could use easy to write but inefficient solution. Since it isn't, you need to be very careful about lots of things for you to have a solution that's actually reasonable. – Servy Jul 23 '18 at 14:59
  • Can I have a little big more information about the way log file are name and the directories? – Drag and Drop Jul 23 '18 at 15:01
  • But getting all file in a directory with an csv extention that were modified between two date is not a big deal. Even if it's not efficient. but if you start to filter and merge the result from the selected file to have your result, it's start to llok like a mess. If you have to handle edit well, Stop everything and use a Bd. – Drag and Drop Jul 23 '18 at 15:02
  • @DragandDrop No, but finding which records within giant log files have dates that fall within the given date range *is* a big deal. You don't want to be traversing the entirety of a month long log file with tons of data just to find all of the records on the 30th of the month. – Servy Jul 23 '18 at 15:07
  • Probably the information you have given is not sufficient. It is not really clear what you would like to do. Is this a matter of filtering the files in the file system or content of files? Some sample would be nice. There are libraries to Linq To CSV. – Cetin Basoz Jul 23 '18 at 15:10
  • @DragandDrop I'm actually storing the CSVs as blobs, and giving them the filename ///log.txt, splitting them into files that are 1 MB or less. – Slothario Jul 23 '18 at 16:19

1 Answers1

1

If you want to filter both on the log file name and on the log file contents in the same Where expression, I don't see a solution without a custom IQueryable LINQ provider, because that's exactly the use case for them: To access data in a smart way based on the expressions used in the LINQ query.

That said, it might be worth to use a multi-step approach as a compromise:

  1. Use LINQ to restrict the log files to be searched,
  2. read the files and
  3. use LINQ for further searching.

Example:

IEnumerable<LogFile> files = LogFiles.Where(f => f.Date > new DateTime(17, 1, 1) && f.AccountID == 4711);
IEnumerable<LogData> data = ParseLogFiles(files);
IEnumerable<LogData> filteredData = data.Where(d => d.val1 == 42 && d.val2 > 17);
LogData firstMatch = filteredData.FirstOrDefault();

If you implement ParseLogFiles (a) with deferred execution and (b) as an extension method on IEnumerable<LogFile>, the resulting code will look-and-feel very similar to pure LINQ:

var filteredData = LogFiles.
    Where(f => f.Date > new DateTime(17, 1, 1) && f.AccountID = 4711).
    ParseLogFiles().
    Where(d => d.val == 42 && d.val2 > 17);

// If ParseLogFiles uses deferred execution, the following line won't read
// more log files than required to get the first matching row:
var firstMatch = filteredData.First();

It's a bit more work than having it all in one single LINQ query, but it saves you from having to implement your own LINQ provider.

Heinzi
  • 167,459
  • 57
  • 363
  • 519