0

I'm trying to make a script that will search for log lines that have occurred between a certain time/date range in a log file.

I tried using the solution in this page:

Filter log file entries based on date range

That solution works just fine, but it takes a bit to complete. Are there any other methods for performing this search that may yield results faster? I'm not being OCD about the speed in this case, it's just that I am searching through some syslog files that contain several gigabytes of data each, so if I could shave some time off of this search it would be fantastic. Grep with regex came to mind, but I'm not sure if it would make that much of a difference.

Here is the log format that is used in the log files:

2014-12-31T23:59:33-05:00 device logdata

Community
  • 1
  • 1
lacrosse1991
  • 2,972
  • 7
  • 38
  • 47

3 Answers3

2

The lines are sorted, so you can use the look command. It should be much faster than awk or grep, because it uses a binary search.

bwt
  • 17,292
  • 1
  • 42
  • 60
  • Actually, if one is using log-aggregation (collecting from different machines), the timestamps are not necessarily in sorted order. Even with NTP. – Thomas Dickey Jul 21 '15 at 20:39
  • True, but then there is probably little you can do to avoid scanning the whole file, which seems to be a disk bound task, in which case grep, awk or something else does not really matter. Maybe sort the files once ? (if they are searched multiples times), but I don't if this is feasible with multi gigabyte files – bwt Jul 22 '15 at 09:02
  • thanks! I was initially under the impression that the log lines were in order, but for some reason they are not (random lines are out of order, will have to ask someone about that as I do not manage the log server). In the end I found that the fastest solution was grepping for the device that I want and then retrieving the specific time range after (to cut down on the work that awk needs to do). I'm using mktime on each line using awk so that I can avoid issues if a specific timestamp does not exist, but that takes quite some time when I run it against the entire log file. – lacrosse1991 Jul 23 '15 at 02:58
0

If you really search for performance optimized solution then forgot tools processing the whole log files. I expect the log files are sorted by time so you do not need to scan whole file. You can write a simple script/program and implement bisection method to find time interval borders and then print everything in between.

Zaboj Campula
  • 3,155
  • 3
  • 24
  • 42