Efficient search algorithm in python to search strings in all sheets of an excel workbook and return matching sheet numbers

Question

How do I search a string/pattern in all the sheets of a workbook and return all matching sheet numbers of the workbook?

I can traverse all the sheets in an Excel workbook, one by one, and search the string in each sheet (like a linear search) but it is inefficient and takes a long time, and I have to process hundreds of workbooks or even more.

Update 1: Sample code

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number

if __name__ == '__main__':
    __spec__ = None

    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")


    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    pool.map(searchSheets,files) # It did not work
    pool.close()

Update 2:Error

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 119, in work
er
    result = (True, func(*args, **kwds))
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapst
ar
    return list(map(*args))
  File "C:\temp3.py", line 36, in searchSheet
    wb = xl_wb(f)
  File "C:\ProgramData\Anaconda3\lib\site-packages\xlrd\__init__.py", line 116,
in open_workbook
    with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'C'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\temp3.py", line 167, in <module>
    pool.map(searchSheet,files)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: 'C'
>>>

Probably you should split this problem into pieces. Do you know, how to code in Python? Do you know excel document structure? Is it inefficient? Please specify which phase are you in and edit your question? — DonPaulie, Apr 25 '18 at 08:42
Splitting sounds great but not sure how to do that. I am new to python but I know how to code in it. The current implementation is definitely inefficient. I have implemented the simple algorithm(using pandas & xlrd) which traverses to each sheet of the workbook. Next phase is to build an algorithm which is time efficient but I am looking for ideas/code just to do that. — raj sharma, Apr 25 '18 at 10:04

score 0 · Answer 1 · answered Apr 25 '18 at 08:36

0

A search in a sheet does not depend of previous searchs, and a search in a workbook does not depend neither of previous search. This is the typical case where you can do multithreading.

This post describe the way to do it in Python How to use threading in Python?

So in pseudo code :

make search in parallel on each sheet of each workbook
Aggreate and present results.

answered Apr 25 '18 at 08:36

sandwood

2,038
20
38

Thanks for suggestion. Let me take a look at it. – raj sharma Apr 25 '18 at 10:12
1

multiprocessing with Pool and map does most of the work. As you simply have to define a function that opens and searches a single workbook and then map that function over a list of the workbooks. – Dan D. Apr 25 '18 at 14:59
The sample code is updated in the question. @Dan D,that is how solution is being designed,however,I do not see multiple thread getting triggered when I execute it. It executes the way it was doing earlier and no performance improvement. Any suggestions! – raj sharma Apr 26 '18 at 08:07
Your code is probably incorrect pool=ThreadPool(processes=10) pool.map(searchSheets(files),range(1,len(files))) This is not what should be done. You must pass a searchSheets function taking ONLY one file, and the second argument is not a table of integer but a table of file. Your code right now is doing 1..nombre of files the same search on every files – sandwood Apr 26 '18 at 12:08
Simply `pool.map(searchSheets, files)`. – Dan D. Apr 26 '18 at 16:10
@DanD,Updated my code as suggested. It gives the error as updated in question above. It should pick first line from files variable and then control should move to target function,searchSheets,but looks like its taking first character of first file path,says,'C:\sample.text' which actually is not path. This is weird! – raj sharma Apr 27 '18 at 10:27
Thanks @DanD and sandwood for suggestions/solutions. The code is updated with solution – raj sharma May 02 '18 at 08:20

score 0 · Accepted Answer · answered May 02 '18 at 11:09

Solution

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

def searchSheets(fnames):
    #Search Logic here
    #Loop over each Sheet
    #Search for string 'Balance' in each Sheet
    #Return matching Sheet Number

if __name__ == '__main__':
    __spec__ = None

    folder = "C://AB//"
    if os.path.exists(folder):
        files = glob.glob(folder + "*.xlsx")


    #Multi threading   
    pool = Pool()
    pool=ThreadPool(processes=10)
    #Suggested by @Dan D
    #pool.map(searchSheets,files) # It did not work
    pool.map(searchSheets,[workbook for workbook in files],)
    multiprocessing.freeze_support() # this line is needed on window 
    #only,found it in may other posts
    pool.close()    
    #pool.join() #Removed this from code as it made all the workers to wait

Efficient search algorithm in python to search strings in all sheets of an excel workbook and return matching sheet numbers

2 Answers2