0

Thank you for spending your time here and helping out, I really appreciate it.

Currently, I have multiple excel files to loop through.

I am only looping through columns C:D. If the column has the keyword "Abbreviation", I want to extract all the values in that column. This is because my keyword could exist in either column C or D.

My columns will look something like this: enter image description here

OR

enter image description here

After importing my excel files, here is how I loop through the columns that I want to scan through:

wb1 = load_workbook(join(dict_folder, file), data_only = True)
ws = wb1.active

for rowofcellobj in ws["C":"D"]:
   for cellobj in rowofcellobj:
      if cellobj.value == "Abbreviation":
        # extract all words in that column but Idk how to execute this step or if my above steps are correct
        if cellobj.value is not None:
        data = re.findall(r"\b\w+_.*?\w+|[A-Z]*$\b", str(cellobj.value))
        #filtering out blank rows here:
         if data != [ ]:
            if data != [' ']:
                 #extracting words from square brackets in list:
                      fields = data[0]
                      print(fields)

I am stuck at the area which I had commented above, saying that I'm not sure how to execute that step..

  • have you tried to use pandas read excel method, and than use pandas methods (looking for abbreviation column etc.)? – Ezer K Sep 25 '19 at 06:48
  • @EzerK no I have not, I have not tried using pandas at all since I do not have much knowledge on it and I had already started out with openpyxl. Could this be feasible with pandas? Cause I was worried that pandas is not able to read the columns unless it is the first row of the worksheet –  Sep 25 '19 at 06:55
  • This is a classic task for pandas (not yo say it's the only way) take a look at the pandas read excel docs – Ezer K Sep 25 '19 at 07:31

2 Answers2

0

Question: Scan columns for a keyword and extract all values in that column

  • Define start row, here 1:

    min_row = 1
    min_col = None
    
  • Loop all rows, starting at min_row and incrementing:

    for row in ws.iter_rows(min_row=min_row, values_only=True):
        min_row += 1
    
  • try to find the keyword in this row, break if found.
    As the index is 0-based, +1 to get the columns index 1-based.

        try:
            min_col = row.index('Abbreviation') + 1
            break
        except:
            continue
    
  • If found, loop all folloup rows, up to the end.

    Note: You have not defined a end condition!

    if min_col is not None:
        for value in map(lambda x: x[0], 
                         sheet1.iter_rows(min_row=min_row, 
                                          min_col=min_col,                                  
                                          max_col=min_col,
                                          values_only=True)):
            print(value)
    
stovfl
  • 14,998
  • 7
  • 24
  • 51
  • Hi! This seems to be working but it still prints out rows with None.... –  Sep 26 '19 at 06:35
0

Pandas solution, inspired by (link)

Example file:

enter image description here

import pandas as pd 
import numpy as np

df = pd.read_excel('tst.xlsx', usecols="C:D")
df = df.fillna('') 

for row in range(df.shape[0]): 
       for col in range(df.shape[1]):
           if df.iat[row,col] == 'Abbreviation':
             row_start = row
             col_required = col
             break

df = df.loc[row_start+1:, df.columns[col_required]]
df.replace(['','\s+'], np.nan, inplace=True, regex=True)
df.dropna(inplace=True)

print(df)

Result:

9      sfsdfd
10    fgfg_ff
12        dfs
13        ddd
15      dd_hh
  • Hi Vladislav, thank you for your response! I have been trying to implement your code. But I cant even get to that step because when i run: `dict_folder = "C:/Users/xxx/Documents/Docs did/Requests"` `for file in os.listdir(dict_folder):` `if re.search(r'\.xlsx$', file):` `df = pd.read_excel((file), usecols = "C:D")` <<<<-------- these throws up an error in my Python saying "FileNotFoundError: [Errno 2] No such file or directory: 'xxxxx.xlsx' –  Sep 26 '19 at 03:46