2

I'm trying to read through a huge text file, approx 10 gigabytes. I want to find the last occurrence of a string.

e.g below is a sample of 5 lines the 2nd and 5th are the same string.
I want to take the last one as it is the latest and output that to a text file using streamreader.

Am I better off using Regex or am I better off using a lastindexof to determine if it is the last string?

I have a lot of these searches to do so I would create some kind of array and have it search from bottom up to improve performance.

Can someone point me in the right direction?

GET/a/users/115656WindowsNT6.1;Trident
GET/a/users/126692MSIE7.0;WindowsNT6.1
GET/a/users/77562WindowsNT6.1;WOW64;Tr
GET/a/users/35650WindowsNT6.1;WOW64;Tr
GET/a/users/126692MSIE7.0;WindowsNT6.2
moffeltje
  • 4,521
  • 4
  • 33
  • 57
vbvirg20
  • 115
  • 12
  • 2
    You say you have 10 gigs of these data. I'd use an sqlite-based solution, reading in the data into a DB, and then getting just distinct values. I would not be using any regex here, *unless* you also want to get specific patterns from those strings. Since the lines are identical, it makes no sense getting just the last occurrence (last = first). – Wiktor Stribiżew May 27 '15 at 14:35
  • See http://bytes.com/topic/net/answers/109316-reading-large-text-file-line-line-backwards for a possible solution. – Tony Hinkle May 27 '15 at 14:36
  • 1
    Really you're looking at reading the file backwards and checking lines as you go. Reading backwards is difficult though. See this question: http://stackoverflow.com/questions/452902/how-to-read-a-text-file-reversely-with-iterator-in-c-sharp – Jon Egerton May 27 '15 at 14:36
  • Yes.. you can use datastore based solution.. it will be faster and efficient than using regex.. – karthik manchala May 27 '15 at 14:42
  • Try a search for "file random access." If your records are of fixed record length it should be pretty easy to get the last N records and copy those to a string. – rheitzman May 27 '15 at 16:54
  • how do you identify the substring you want to find? in your example lines 2 and 5 differ in the last character. – 1010 May 28 '15 at 02:36

1 Answers1

0

I believe that File.ReadLines is one of the best ways to read the large files according to msdn :

The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.

So depending on this I wrote the following code and I hope it help :

Dim myList As List(Of String) = IO.File.ReadLines("MyLargFile.txt").OfType(Of String)().Where(Function(s) s.Contains("126692MSIE7")).ToList

This piece of code will return you a list of match string lines.

Output :

myList(0) = "GET/a/users/126692MSIE7.0;WindowsNT6.1" 
myList(1) = "GET/a/users/126692MSIE7.0;WindowsNT6.2"

And of course if need the last line you may use the Last method:

Dim last As String = myList.Last
Top Systems
  • 951
  • 12
  • 24