0

How can I search an xlsx file for a specific string variable? Example:

ORIGINAL TEXT FILE:

one - a b c                                         
two - a b c                                      
three - a b c                                     
four - a b c                                           
five - a b c                                           

XLSX FILE:

two                                                    
three                                                  
five

OUTPUT CSV FILE:

The following in two columns:

two a 
two b 
two c 
three a 
three b 
three c 
five a 
five b
five c 

In short, I am going line-by-line in my original text file, selecting a specific string such as 'one', 'two', or 'three' and I am looking to see if that string exists in the xlsx file (I realize that it might be more ideal to search the other way around, but I'm trying to keep things very simple as this is only a section of my code). Ultimately I then identify the a,b,c paramaters and send it all out to a csv file.

-- I already imported os, csv, re, openpyxl, load_workbook (from openpyxl), sys, and set default encoding to utf-8

with open("OriginalTextFile.txt", 'r') as SearchList:
    for line in SearchList:
        line_text = str(line)
        try:
            test_query = re.search('start (.+?) end', line_text).group(1)
            print test_query # confirms that I AM getting the correct test_query
            if str(test_query) in load_workbook('Subset.xlsx', read_only=True):
                print 'FOUND IT IN SUBSET!'

                 ----- Continues -----

My question lies within line 7 specifically; how can I identify the existence of a specific string (eg. "one") in my subset.xlsx file? Is this line correct and I'm just missing something simple? Any additional suggestions, links, tutorials, documentations are welcome!

Thank you very much, in advance!

Tomasz Jakub Rup
  • 10,502
  • 7
  • 48
  • 49
mjfred
  • 77
  • 1
  • 11
  • 1
    Possible duplicate of [Openpyxl optimizing cells search speed](http://stackoverflow.com/questions/34054272/openpyxl-optimizing-cells-search-speed) – Charlie Clark Dec 11 '15 at 09:06
  • ^ This might do the trick with a little tweaking. Thanks Charlie! – mjfred Dec 11 '15 at 13:56
  • As asked, this question is definitely not a duplicate of the linked question, though answers to the linked question could serve as answers to this question. However, the text of the linked question *itself* contains **an** answer to this question. If you look at the code in this question, it's clear the asker has not even gotten to the point where he searches each of the cells for the desired text. He's thinking in much more abstract terms, like doing a query on a database or at worst a grep on a text file. Potential answers to this question might involve higher-level tools than OpenPyXL. – John Y Dec 21 '15 at 15:29
  • Although not a solution to my question, I was able to get around the issue by utilizing the PANDAS library. Via Pandas, I absorbed the data into lists and was able to compare the lists much more easily than searching for data sets overall. This solution will not work well for unstructured "original" text documents. It did work for me however because the original text document was well-organized and repetitive in its organization. I still believe that a more elegant solution to my problem exists. – mjfred Dec 21 '15 at 18:59

0 Answers0