0

Is there a way to search text files in python for a phrase withough having to use forloops and if statments such as:

for line in file:
    if line in myphrase:
        do something

This seems like a very inefficient way to go through the file as it does not run in parallel if I understand correctly, but rather iteratively. Is re.search a more efficient system by which to do it?

Lamma
  • 895
  • 1
  • 12
  • 26
  • Have you considered regexes? It'd probably still work linearly (i.e. a single thread scanning over the text as a series) but any loops would be hidden from you. Loops aren't as bad as all that, in fact, they're kind of what computers do best. Alternately, you could force the parallelisation of a series of threads working on different portions of your file, but then you'd have to manage their synchronisation and cross-talk, which might outweigh the benefit of what you're trying to do in the first place. – Thomas Kimber Feb 28 '20 at 14:15
  • If you have any efficiency issues maybe you should consider pre-procces from any kind. it could help you. – Yanirmr Feb 28 '20 at 14:17
  • @ThomasKimber I have looked at them but have heard they can also be slow on large files. It just striked me as odd that there is no paralelisation method here. – Lamma Feb 28 '20 at 14:21
  • I doubt that regular expressions will save you any time. Some have pointed out that it's faster to read the contents of the file all at once, which is true for reasonably sized files. If it's a large file, however, reading the whole file into memory will also cause a performance hit. I personally would take the simple approach and do exactly what you have done here. It's convenient because it splits by newlines, which presumably are not part of your phrase. If you have to optimize by reading in larger chunks, you have to beware that you might be splitting your phrase at the chunk boundaries – TallChuck Feb 28 '20 at 14:22
  • @TallChuck I will give both a go and see how they perform. What would you recomend for extractng information between two phrases in a file? – Lamma Feb 28 '20 at 14:24
  • 2
    [Just gonna drop this here](https://stackoverflow.com/a/4901653/8805293) – Hampus Larsson Feb 28 '20 at 14:24
  • It depends how large is large - I regularly see workflows like this running on files with > 1 million rows. – Thomas Kimber Feb 28 '20 at 14:26
  • @HampusLarsson good link, assuming the full string is loaded into memory (TLDR -- `in` operator is much faster than regex for simple searches) – TallChuck Feb 28 '20 at 14:26
  • @TallChuck If your bottleneck is actually reading the stuff from disk, then I don't think that you should be focused on what form of string-parsing should be used. – Hampus Larsson Feb 28 '20 at 14:28
  • the utility of regex is really in searching for patterns. If you are looking for an exact string match, it's probably overkill – TallChuck Feb 28 '20 at 14:28
  • @HampusLarsson totally agree – TallChuck Feb 28 '20 at 14:28
  • I am looking for exact string matches as markers in the file to then go for example 1, 3 and 6 lines up for then store that information. But to get these markers I need to search between two constant phrases within the file for esample "bellow me" and "above me" where there would be strings of numbers that are the marker locations. – Lamma Feb 28 '20 at 14:31

3 Answers3

4

Reading a sequential file (e.g. a text file) is always going to be a sequential process. Unless you can store it in separate chunks or skip ahead somehow it will be hard to do any parallel processing.

What you could do is separate the inherently sequential reading process from the searching process. This requires that the file content be naturally separated into chunks (e.g. lines) across which the search is not intended to find a result.

The general structure would look like this:

  • initiate a list of processing threads with input queues
  • read the file line by line and accumulate chunks of lines up to a given threshold
  • when the threshold or the end of file is reached, add the chunk of lines to the next processing thread's input queue
  • wait for all processing threads to be done
  • merge results from all the search threads.

In this era of solid state drives and fast memory busses, you would need some pretty compelling constraining factors to justify going to that much trouble.

You can figure out your minimum processing time by measuring how long it takes to read (without processing) all the lines in your largest file. It is unlikely that the search process for each line will add much to that time given that I/O to read the data (even on an SSD) will take much longer than the search operation's CPU time.

Alain T.
  • 40,517
  • 4
  • 31
  • 51
3

The tool you need is called regular expressions (regex).

You can use it as follows:

import re

if re.match(myphrase, myfile.read()):
    do_something()
Simon Crane
  • 2,122
  • 2
  • 10
  • 21
  • I have previously read people saying that regex is slow on large files, is this truth and then if so what is considerd a large file? – Lamma Feb 28 '20 at 14:16
3

Let's say you have the file:

Hello World!
I am a file.

Then:

file = open("file.txt", "r")
x = file.read()
# x is now:
"Hello World!\nI am a file."
# just one string means that you can search it faster.
# Remember:
file.close()

Edit:

To actually test how long it takes:

import time
start_time = time.time()
# Read File here
end_time = time.time()
print("This meathod took " + str( end_time - start_time ) + " seconds to run!")

Another Edit:

I read some other articles and did the test, and the fastest checking meathod if you're just trying to find True of False is:

x = file.read() # "Hello World!\nI am a file."
tofind = "Hello"
tofind_in_x = tofind in x
# True

This meathod was faster than regex in my tests by quite a bit.

  • hmm this is a very intresting way of looking at it. But i fear it would become very slow with large files ? – Lamma Feb 28 '20 at 14:19
  • Depending on large file size, it will probably be slower so matter what you do. It may be best to test by creating a large find and testing the different methods. You could test the time they take with The edit to my awnser –  Feb 28 '20 at 14:23