Scan columns for a keyword and extract all values in that column, if the keyword is found in that column

Question

Thank you for spending your time here and helping out, I really appreciate it.

Currently, I have multiple excel files to loop through.

I am only looping through columns C:D. If the column has the keyword "Abbreviation", I want to extract all the values in that column. This is because my keyword could exist in either column C or D.

My columns will look something like this:

OR

After importing my excel files, here is how I loop through the columns that I want to scan through:

wb1 = load_workbook(join(dict_folder, file), data_only = True)
ws = wb1.active

for rowofcellobj in ws["C":"D"]:
   for cellobj in rowofcellobj:
      if cellobj.value == "Abbreviation":
        # extract all words in that column but Idk how to execute this step or if my above steps are correct
        if cellobj.value is not None:
        data = re.findall(r"\b\w+_.*?\w+|[A-Z]*$\b", str(cellobj.value))
        #filtering out blank rows here:
         if data != [ ]:
            if data != [' ']:
                 #extracting words from square brackets in list:
                      fields = data[0]
                      print(fields)

I am stuck at the area which I had commented above, saying that I'm not sure how to execute that step..

have you tried to use pandas read excel method, and than use pandas methods (looking for abbreviation column etc.)? — Ezer K, Sep 25 '19 at 06:48
@EzerK no I have not, I have not tried using pandas at all since I do not have much knowledge on it and I had already started out with openpyxl. Could this be feasible with pandas? Cause I was worried that pandas is not able to read the columns unless it is the first row of the worksheet — , Sep 25 '19 at 06:55
This is a classic task for pandas (not yo say it's the only way) take a look at the pandas read excel docs — Ezer K, Sep 25 '19 at 07:31

stovfl · Answer 1 · 2019-09-25T11:49:02.997

Question: Scan columns for a keyword and extract all values in that column

Define start row, here 1:
```
min_row = 1
min_col = None
```

Loop all rows, starting at min_row and incrementing:

for row in ws.iter_rows(min_row=min_row, values_only=True):
    min_row += 1

try to find the keyword in this row, break if found.
As the index is 0-based, +1 to get the columns index 1-based.

    try:
        min_col = row.index('Abbreviation') + 1
        break
    except:
        continue

If found, loop all folloup rows, up to the end.

Note: You have not defined a end condition!

if min_col is not None:
    for value in map(lambda x: x[0], 
                     sheet1.iter_rows(min_row=min_row, 
                                      min_col=min_col,                                  
                                      max_col=min_col,
                                      values_only=True)):
        print(value)

Hi! This seems to be working but it still prints out rows with None.... — , Sep 26 '19 at 06:35

score 0 · Answer 2 · answered Sep 25 '19 at 07:57

0

Pandas solution, inspired by (link)

Example file:

import pandas as pd 
import numpy as np

df = pd.read_excel('tst.xlsx', usecols="C:D")
df = df.fillna('') 

for row in range(df.shape[0]): 
       for col in range(df.shape[1]):
           if df.iat[row,col] == 'Abbreviation':
             row_start = row
             col_required = col
             break

df = df.loc[row_start+1:, df.columns[col_required]]
df.replace(['','\s+'], np.nan, inplace=True, regex=True)
df.dropna(inplace=True)

print(df)

Result:

9      sfsdfd
10    fgfg_ff
12        dfs
13        ddd
15      dd_hh

answered Sep 25 '19 at 07:57

Vladislav Gladkikh

377
4
17

Hi Vladislav, thank you for your response! I have been trying to implement your code. But I cant even get to that step because when i run: `dict_folder = "C:/Users/xxx/Documents/Docs did/Requests"` `for file in os.listdir(dict_folder):` `if re.search(r'\.xlsx$', file):` `df = pd.read_excel((file), usecols = "C:D")` <<<<-------- these throws up an error in my Python saying "FileNotFoundError: [Errno 2] No such file or directory: 'xxxxx.xlsx' – Sep 26 '19 at 03:46

Scan columns for a keyword and extract all values in that column, if the keyword is found in that column

2 Answers2