I need to search a pretty large text file for a particular string. Its a build log with about 5000 lines of text. Whats the best way to go about doing that? Using regex shouldn't cause any problems should it? I'll go ahead and read blocks of lines, and use the simple find.
-
175000 lines? That is not 'pretty large' :-) – eumiro Oct 08 '10 at 20:29
-
1Blocks of lines? It sounds like your optimization is costing more than it's saving (for only a 5000 line file...). You're not concatenating strings in a loop, are you? :) – JoshD Oct 08 '10 at 20:34
-
What is 'pretty large'? @eumiro – OuuGiii Jul 25 '17 at 09:33
-
1@OuuGiii a file that is larger than your RAM, so you cannot read it at once. – eumiro Jul 25 '17 at 09:41
9 Answers
If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

- 207,213
- 34
- 299
- 261
You could do a simple find:
f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('string')
A simple find will be quite a bit quicker than regex if you can get away with it.

- 12,490
- 3
- 42
- 53
-
1I just tried this code there now but I'm printing answer to find out what it is, why is it when the string is not found answer is equal to "-1" but when it is found, the answer can be lots of different numbers? – Mark O'Sullivan Aug 28 '14 at 14:45
-
@MarkO'Sullivan the `find` command returns the index of the first match. -1 means no match; other values are the start index – Chen A. Nov 14 '17 at 09:44
-
3This code is inefficient. The `f.read()` will load the whole file into memory, which is useless and slow when dealing with very large files. Better iterate on per-line basis instead (using generators or a simple for loop) – Chen A. Nov 14 '17 at 09:45
-
@Vinny Iterating on per-line basis doesn’t work if the string you’re looking for spans multiple lines. This answer might not be efficient in memory, but it’s probably the best one if your file is not that big (and 5000 lines is *not* a big file :) ). – bfontaine May 10 '18 at 15:55
-
1@bfontaine Regarding the "best" answer, I don't think this is the best one for the question asked. If this is supposedly for large files (as the title implies, and the reason web engine searches will arrive; we are not merely restricted to the 5000 line specification by the OP, but should seek to optimize ourselves as a resource), then [laurasia's answer below](https://stackoverflow.com/a/4937035/8117067) is ideal, being both fast and memory efficient. After all, [going into swap](https://serverfault.com/q/48486) will slow your program to a screeching halt. – Graham Jul 21 '19 at 12:11
The following function works for textfiles and binary files (returns only position in byte-count though), it does have the benefit to find strings even if they would overlap a line or buffer and would not be found when searching line- or buffer-wise.
def fnd(fname, s, start=0):
with open(fname, 'rb') as f:
fsize = os.path.getsize(fname)
bsize = 4096
buffer = None
if start > 0:
f.seek(start)
overlap = len(s) - 1
while True:
if (f.tell() >= overlap and f.tell() < fsize):
f.seek(f.tell() - overlap)
buffer = f.read(bsize)
if buffer:
pos = buffer.find(s)
if pos >= 0:
return f.tell() - (len(buffer) - pos)
else:
return -1
The idea behind this is:
- seek to a start position in file
- read from file to buffer (the search strings has to be smaller than the buffer size) but if not at the beginning, drop back the - 1 bytes, to catch the string if started at the end of the last read buffer and continued on the next one.
- return position or -1 if not found
I used something like this to find signatures of files inside larger ISO9660 files, which was quite fast and did not use much memory, you can also use a larger buffer to speed things up.
-
What is "s" supposed to represent? Oh, perhaps it's the string you're trying to find? Yes, I see it now. – harperville Feb 21 '14 at 20:06
-
I created [an answer inspired by this one](https://stackoverflow.com/a/57133315/8117067). – Graham Jul 21 '19 at 12:44
This is a multiprocessing example of file text searching. TODO: How to stop the processes once the text has been found and reliably report line number?
import multiprocessing, os, time
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
def FindText( host, file_name, text):
file_size = os.stat(file_name ).st_size
m1 = open(file_name, "r")
#work out file size to divide up to farm out line counting
chunk = (file_size / NUMBER_OF_PROCESSES ) + 1
lines = 0
line_found_at = -1
seekStart = chunk * (host)
seekEnd = chunk * (host+1)
if seekEnd > file_size:
seekEnd = file_size
if host > 0:
m1.seek( seekStart )
m1.readline()
line = m1.readline()
while len(line) > 0:
lines += 1
if text in line:
#found the line
line_found_at = lines
break
if m1.tell() > seekEnd or len(line) == 0:
break
line = m1.readline()
m1.close()
return host,lines,line_found_at
# Function run by worker processes
def worker(input, output):
for host,file_name,text in iter(input.get, 'STOP'):
output.put(FindText( host,file_name,text ))
def main(file_name,text):
t_start = time.time()
# Create queues
task_queue = multiprocessing.Queue()
done_queue = multiprocessing.Queue()
#submit file to open and text to find
print 'Starting', NUMBER_OF_PROCESSES, 'searching workers'
for h in range( NUMBER_OF_PROCESSES ):
t = (h,file_name,text)
task_queue.put(t)
#Start worker processes
for _i in range(NUMBER_OF_PROCESSES):
multiprocessing.Process(target=worker, args=(task_queue, done_queue)).start()
# Get and print results
results = {}
for _i in range(NUMBER_OF_PROCESSES):
host,lines,line_found = done_queue.get()
results[host] = (lines,line_found)
# Tell child processes to stop
for _i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# print "Stopping Process #%s" % i
total_lines = 0
for h in range(NUMBER_OF_PROCESSES):
if results[h][1] > -1:
print text, 'Found at', total_lines + results[h][1], 'in', time.time() - t_start, 'seconds'
break
total_lines += results[h][0]
if __name__ == "__main__":
main( file_name = 'testFile.txt', text = 'IPI1520' )

- 14,208
- 13
- 83
- 99
I'm surprised no one mentioned mapping the file into memory: mmap
With this you can access the file as if it were already loaded into memory and the OS will take care of mapping it in and out as possible. Also, if you do this from 2 independent processes and they map the file "shared", they will share the underlying memory.
Once mapped, it will behave like a bytearray. You can use regular expressions, find or any of the other common methods.
Beware that this approach is a little OS specific. It will not be automatically portable.

- 2,752
- 15
- 30
If there is no way to tell where the string will be (first half, second half, etc) then there is really no optimized way to do the search other than the builtin "find" function. You could reduce the I/O time and memory consumption by not reading the file all in one shot, but at 4kb blocks (which is usually the size of an hard disk block). This will not make the search faster, unless the string is in the first part of the file, but in all case will reduce memory consumption which might be a good idea if the file is huge.

- 642
- 3
- 5
-
Depends on how huge. If it's around 1MB, I'd expect this way to be slower than loading the whole thing because of the latency for each read of all 256 blocks. If anything, I'd prefer a larger size of chunk to read each time. Perhaps a test... – JoshD Oct 08 '10 at 20:09
-
The latency may indeed be more, but not necessarily, the important is to read a multiple of the physical block size, not to waste read data. Indeed i'd not call a 1mb text file "huge", i was think something along the lines of some hundred of megabytes. I agree with you 100% than if the file is less than say 10 or even 50mb is not worth to read it in chunks. – Bitgamma Oct 08 '10 at 20:18
I like the solution of Javier. I did not try it, but it sounds cool!
For reading through a arbitary large text and wanting to know it a string exists, replace it, you can use Flashtext, which is faster than Regex with very large files.
Edit:
From the developer page:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
Or when extracting the offset:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
Limitation:
I want to point out that this solution is NOT the optimal solution for the given question. For the given question in
from eumiro's solution under the caveat given by @bfontaine in the respective comment is the definetly the best answer.
flashtext
is a powerful solution, if you want to find all (!) occurrences of a string in a given text. This is something in
cannot do (and is not made for to do).

- 591
- 4
- 12
-
Try to provide a minimalistic example in your answers too so more people may get help from your answer. – Zeeshan Adil Sep 12 '18 at 05:39
This is entirely inspired by laurasia's answer above, but it refines the structure.
It also adds some checks:
- It will correctly return
0
when searching an empty file for the empty string. In laurasia's answer, this is an edge case that will return-1
. - It also pre-checks whether the goal string is larger than the buffer size, and raises an error if this is the case.
In practice, the goal string should be much smaller than the buffer for efficiency, and there are more efficient methods of searching if the size of the goal string is very close to the size of the buffer.
def fnd(fname, goal, start=0, bsize=4096):
if bsize < len(goal):
raise ValueError("The buffer size must be larger than the string being searched for.")
with open(fname, 'rb') as f:
if start > 0:
f.seek(start)
overlap = len(goal) - 1
while True:
buffer = f.read(bsize)
pos = buffer.find(goal)
if pos >= 0:
return f.tell() - len(buffer) + pos
if not buffer:
return -1
f.seek(f.tell() - overlap)

- 3,153
- 3
- 16
- 31
5000 lines isn't big (well, depends on how long the lines are...)
Anyway: assuming the string will be a word and will be seperated by whitespace...
lines=open(file_path,'r').readlines()
str_wanted="whatever_youre_looking_for"
for i in range(len(lines)):
l1=lines.split()
for p in range(len(l1)):
if l1[p]==str_wanted:
#found
# i is the file line, lines[i] is the full line, etc.

- 155
- 5
-
2l1=lines.split() AttributeError: 'list' object has no attribute 'split' – misguided Jul 03 '13 at 01:51