For my master thesis i downloaded a ton of finance related files. my objective is to find a specific set of words ("chapter 11") to flag all companies that have gone through the debt restructuring process. The problem is that i have more than 1.2 milion little files that makes the search really inefficient. For now i wrote very basic code and i reached a velocity of 1000 documents every 40-50 seconds. i was wondering if there are some specific libraries or methods (or even programming languages) to search even faster. this is the function i'm using so far
def get_items(m):
word = "chapter 11"
f = open(m, encoding='utf8')
document = f.read()
f.close()
return (word in document.lower())
# apply the function to the list of names:
l_v1 = list(map(get_items,filenames))
the size of the files varies between 5 and 4000 KB