0

My file consists of logs. In which, every line is a log with starting column as time. All the lines in file are sorted according to the timestamp. I have to find out where the given timestamp occurs in the given file, file size could be of around 10gb. I can sequentially check line by line. Is there any way this can be done in optimized way to find the required?

Edit: I'm thinking of applying binary search. But what would be the approach I should go with to apply binary search on file? Can I use randomAccessFile class and use pointers? If so, How can I spot starting of a specific line where my pointer lands to get the timestamp of that log, thanks.

Sample log in the file: 2020-01-31T20:12:38.1234Z,field1,field2,etc,.....\n

1 Answers1

0

Option 1 (fastest):

If possible, create another file that acts as an index for the file when generating the input. This could represent what index in the byte array each line exists at as well as the length of the line in bytes. You could even break this up into multiple index files.

// 1 is line id, 0 is byte start index, 12 is end index 
1 0 12 

Option 2:

A good solution would be a binary search implementation. This would likely be significantly faster than a linear search. The idea is that if what you're seeking is unequal to the middle element (line) then you're going to use the left half of the file byte array, otherwise the right half of the byte array.

Jason
  • 5,154
  • 2
  • 12
  • 22
  • I definitely second that. But How can i apply binary search with line numbers. how can i access specific line of file using line number? – surya phani teja Jul 15 '20 at 13:21
  • How to do this without loading 10gb into memory? I don't think OP has 15-20 GB RAM (10gb file+object headers+...) – JCWasmx86 Jul 15 '20 at 13:22
  • You could easily load chunks of the file at a time instead of the entire file @JCWasmx86. – Jason Jul 15 '20 at 13:23
  • 1
    The byte array likely uses the value `10` which is `\n` also known as `new line`. You can assume the middle based on file size / 2 (roughly) and seek until `\n` is found in the array and find the given line and parse. – Jason Jul 15 '20 at 13:25
  • @suryaphaniteja Updated to provide another option. – Jason Jul 15 '20 at 13:27
  • @Jason Thanks for your inputs. unfortunately, I need to process files which are already generated. I don't have control of them while being generated. – surya phani teja Jul 15 '20 at 13:30
  • If that is the case, likely a binary search with concurrent threads would be the best answer, to my knowledge. – Jason Jul 15 '20 at 13:31
  • You can assume that most lines are of similar length, right? So you can just seek to the mid (in bytes) and forward to the next newline. – Felix Jul 15 '20 at 13:35
  • Yup that's what's I'm suggesting. – Jason Jul 15 '20 at 13:43
  • Thanks codeflush.dev and @jason. If anyone could provide a code snippet of how to calculate offset and seek until the \n, it would be really helpful for me. I'm new to files concept. – surya phani teja Jul 15 '20 at 13:49
  • You're better off doing the research yourself or asking another question with a small snippet of the file format. – Jason Jul 15 '20 at 13:53