2

I have some log files that is written by log4cpp format

--By the nature of log4cpp, this file is sorted by the datetime at the beginning of each line

Assuming the format is like

2012-09-02 17:17:36.891 This is line 1 in file 2   
...
2013-08-05 14:17:35.344 This is line 607082 in file 2
2013-08-05 14:17:36.891 This is line 607083 in file 2
...
2013-09-05 14:27:36.891 This is line 934594 in file 2

Now I am writing a program to parse these files and try to quickly locate a line.

For example, if I run

./my_program -start_time "2013-08-05 14:17:36" file_2.txt

I am expecting this program can return 607083 as a result.

Also, the -start_time can be based upon other granularity like "2013-08-05 14:17:35.899" or "2013-08-15" But I am expecting the nearest result.

I can traverse this file line by line, and compare the timestamp at the beginning of each line(just use string comparison) , but it will take O(N) time. I already implemented that and found it's really slow if there are millions of lines at the beginning to skip.

I am wondering if we can use Binary Search for this. I think it is the best way to return the nearest result and only takes O(lgN) time

Yifei
  • 1,944
  • 1
  • 14
  • 20
  • You should profile your program and find out what the slow part is, I expect (if the file is large) most of time is spent reading the file and not on comparing string values. This will point out how to optimize (bigger buffer on reads etc). – Hogan Oct 31 '13 at 14:48
  • Makes sense to me. Split file size in half, seek to that position, find nearest \n, parse date and time -> choose which half to keep digging. If date and time keeps increasing from line to line, you will eventually find what you are searching for. – Anton Oct 31 '13 at 14:49
  • @Anton The lines aren't sorted by time. – this Oct 31 '13 at 14:49
  • Also, why are you making a custom program -- just use awk. – Hogan Oct 31 '13 at 14:49
  • @self, why not? In a log file entries are likely to be sorted naturally. (If nobody played tricks with time settings while writing logs) – Anton Oct 31 '13 at 14:52
  • @Anton Look the example in the question, not sorted. – this Oct 31 '13 at 14:53
  • @self, first line is from year 2012, others are from 2013 – Anton Oct 31 '13 at 14:54
  • @Hogan Yes you are right. But I think I still need binary search for this. For example, I have a file contains millions of lines but the line I expect is at the very end of that file. And I don't want to spend several minutes to traverse the whole file. Instead it would be nice to have a way to get the location in the blink of an eye – Yifei Oct 31 '13 at 14:54
  • You can use awk and tail – Hogan Oct 31 '13 at 14:56
  • @Anton My mistake, i didn't notice that. – this Oct 31 '13 at 14:57
  • 1
    @self Sorry I didn't mention it in the question. This file is generated by log4cpp and it's sorted by time. I've edited the question – Yifei Oct 31 '13 at 14:57
  • @Hogan This is a module of a bigger app, not only a command line utility. And in my opinion, if I awk "2013-08-05 14:17:36" in another file. but that file doesn't have any line starts with "2013-08-05 14:17:36" but it has a bunch of lines starts with "2013-08-05 14:17:37", awk might not locate correctly in this case – Yifei Oct 31 '13 at 15:06

2 Answers2

0

Yes you can. It's an ordered log by date. Why not take the first and last line which should be the most recent and last recent date.

You can make a function that converts date into seconds. in the first call go to the middle of your log and check wether your date is bigger or smaller and so on... (Binary Search)

Hope this helps and hope my explanation on how this will work is clear

artud2000
  • 544
  • 4
  • 9
  • Yeah that's what i am thinking. But in C++ I find it difficult to use seekg and tellg to manipulate the position. Does it support + operator? – Yifei Oct 31 '13 at 15:01
  • Not sure if it supports + operator but you can move back and forward from the current location using seekg it will be for sure difficult to manage the location using seekg. Let me think how can you manage the locations in the file and I'll get back to you – artud2000 Oct 31 '13 at 15:15
0

When you are running it under Unix/Posix, you can mmap() the whole file and operate on memory (and avoid lseek() and friends).

So, you get a 'char *logbuffer = mmap(...)' pointer and can perform you binary search there.

ensc
  • 6,704
  • 14
  • 22
  • Thanks for your answer! But it's a sub-problem of a bigger app, I have to do this operation on a bunch of big files at the same time. So I want to do it on the fly, not load them all in memory :-) – Yifei Oct 31 '13 at 15:28
  • mmap() is cheap; you are allocating only some internal management objects. So reading e.g. a `char line[8192]` consumes probably more memory than mmap()'ing a terabyte sized file. – ensc Oct 31 '13 at 15:39