2

This is my current code:

folder_path1 = os.chdir("C:/Users/xx/Documents/xxx/Test python dict")

words= set()
extracted = set()
for file in os.listdir(folder_path1):
   if file.endswith(".xlsx"):
      wb = load_workbook(file, data_only=True)
      ws = wb.active
      words.add(str(ws['A1'].value))

      wordsextract = re.match(r"(.*)\((.*)\)", str(words))

      extracted.add(str(wordsextract))
      print(extracted)

I'm not sure how to extract only the words inside the parentheses. Hence, I thought it was possible to do a re.match parentheses to extract the words within the parentheses. But it's not working. Does anyone here have knowledge on this? Thanks a bunch in advance

  • 3
    Possible duplicate of [RegEx to match stuff between parentheses](https://stackoverflow.com/questions/6208367/regex-to-match-stuff-between-parentheses) – clubby789 Aug 06 '19 at 15:17
  • 1
    else you would need lazy evaluation: `r'\((.+?)\)'` to get the shortest possible word between parentheses and probably findall with it... – Patrick Artner Aug 06 '19 at 15:21
  • hey @PatrickArtner thank you for your response!! :) I tried your code: `cellvalue.split("(")[-1].split(")")[0]` but there seems to be a syntax error.. –  Aug 07 '19 at 01:02
  • 1
    `set(re.findall("\((.+?)\)", words)` doesn't work? – palvarez Aug 07 '19 at 07:56
  • 1
    @PatrickArtner thank you so much, your first answer worked and I have incorporated it as such: wordsextract = re.findall(r"\((.+?)\)", str(cellvalues)) –  Aug 10 '19 at 17:12
  • 1
    @palvarez thank you so much, yes it worked after i cleaned up my codes as well! :) –  Aug 10 '19 at 17:12

1 Answers1

0

Read the whole column into a set, extract words from each cell value:

Excel-Source:

excel file content image

Program:

from openpyxl import load_workbook
import re
import os

folder_path1 = os.chdir("C:/temp/")

words= set()
extracted = set()
for file in os.listdir(folder_path1):
    if file.endswith("m1.xlsx"):
        wb = load_workbook(file, data_only=True)
        ws = wb.active
        # this is A1 though to A5 - yours is only one cell though, you can change the 
        # min/max to include more columns or rows

        # a set makes no sense here - you read only one cell anyhow, so anything in
        # it is your single possible value string
        # wb.iter_cols(min_col, max_col, min_row, max_row, values_only)
        content = set( *ws.iter_cols(1,1,1,5,True)) - {None} # remove empty cells

        # non-greedy capturing of things in parenthesis
        words = re.findall(r"\((.+?)\)", ' '.join(content), re.DOTALL)
        print(words)

Output:

['problem', 'span \nlines', 'some'] # out of order due to set usage

To do the same with splitting:

# same content as above
for cellvalue in content:
    if set("()").intersection(cellvalue) == {"(",")"}:
        print(cellvalue.split("(")[-1].split(")")[0])

HTH

Documentation:

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69