0

I'm having trouble matching strings in Python. What I'm trying to do is look for lines in documents like this and try to match each line to specific phrases. I'm reading in all the lines and parsing with Beautfiul soup into stripped strings, then iterating through a list of all the lines in the document. From there, I use the following code to match for the specific strings:

if row.upper() == ("AUDIT COMMITTEE REPORT" or "REPORT OF THE AUDIT COMMITTEE"):
    print("Found it!")
if "REPORT" in row.upper():
    print ("******"+row.upper()+"******")

When the code runs, I get the following output:

******COMPENSATION COMMITTEE REPORT******
******REPORT OF THE AUDIT COMMITTEE******
******REPORTING COMPLIANE******
******COMPENSATION COMMITTEE REPORT******
******REPORT OF THE AUDIT COMMITTEE******

The program never finds it when the string is being checked for equality, but when asked if a portion of it is in the string, it's able to find it without trouble. How does string matching working in Python, s.t. these events are occurring, and how can I fix it so that it'll make those exact phrases?

EDIT: Another note that should be made is that these documents are quite large, some exceeding 50 pages easily, and checking if the string is just in the row is not enough. It needs to be an exact match.

Retroflux
  • 57
  • 1
  • 1
  • 9
  • "AUDIT COMMITTEE REPORT" is not equal to "REPORT OF THE AUDIT COMMITTEE" – Copy and Paste Jun 03 '16 at 14:41
  • I think I might be forgetting my Python syntax but shouldn't it be `if row.upper() == "AUDIT COMMITTEE REPORT" or row.upper() == "REPORT OF THE AUDIT COMMITTEE":` – turnip Jun 03 '16 at 14:41
  • @CopyandPaste That's why I have the or clause, so should it not accept either case as a truth value? (the or shouldn't be capitalized, I'll edit that now) – Retroflux Jun 03 '16 at 14:43
  • @PPG yes that's the usual format, but the list of elements was going to be quite long, so I was hoping to conserve line space. The comments answers so far suggest that making a list instead would be the best option. – Retroflux Jun 03 '16 at 14:46
  • what is your input data? – SparkAndShine Jun 03 '16 at 14:48
  • The program is given a list of CIK numbers (the governing unique value for the companies used on the site), then it scrapes all the DEF 14A documents (the doc linked above is an example). The the document is scraped using beautiful soup's stripped strings function, and it is thrown into a list which is iterated through to find the lines. – Retroflux Jun 03 '16 at 14:55

2 Answers2

2

How about this,

if row.upper() in ("AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"):
    print("Found it!")
if "REPORT" in row.upper():
    print ("******"+row.upper()+"******")

Note that ("str1" or "str2") returns the first string, i.e., 'str1'.

>>> ("AUDIT COMMITTEE REPORT" or "REPORT OF THE AUDIT COMMITTEE")
'AUDIT COMMITTEE REPORT'
SparkAndShine
  • 17,001
  • 22
  • 90
  • 134
  • So by changing the 'or' to a comma, we're setting it to check a list of things instead of a... I don't know what the original would be called, a pseudo-list of cases? – Retroflux Jun 03 '16 at 14:45
  • 1
    @Retroflux technically it's a tuple of "things" (in this case strings) – Copy and Paste Jun 03 '16 at 14:46
  • True, as it would be mutable if it's a fixed set of items. So why does listing them with the or clause between each one not work but a tuple does? – Retroflux Jun 03 '16 at 14:49
  • Another note that should be made is that these documents are quite large, some exceeding 50 pages easily, and checking if the string is just in the row is not enough. It needs to be an exact match. Does this still work for that? I'll edit this into the original post. – Retroflux Jun 03 '16 at 14:52
1

You could do something like this using list comprehension.

row = '******AUDIT COMMITTEE REPORT******'
match = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]
is_match = sum([m in row.upper() for m in match])

if is_match:
    print("Found it!")
if "REPORT" in row.upper():
    print ("******"+row.upper()+"******")

First we create a list of all possible matches, these could be loaded from a file, or be statically declared in the python code.

match = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]

Next we loop through all the possible matches and see if anything matches the string row. If something does match, a True boolean would be added to the list, and we can use that the determine if there was a match.

is_match = sum([m in row.upper() for m in match])

If you remove sum() you can see that the output of the list comprehension is simply a list of booleans.

print([m in row.upper() for m in match])
[True, False]

If you want to be a little more efficient and simple, you could implement a function with a for loop.

matches = ["AUDIT COMMITTEE REPORT", "REPORT OF THE AUDIT COMMITTEE"]
def is_match(row):
    for match in matches:
        if match in row.upper():
            return True
    return False

This loop will loop through all possible matches, if it find a match it will instantly return True, otherwise it will exit and return False.

eandersson
  • 25,781
  • 8
  • 89
  • 110
  • Would you mind explaining the third line in your program? I've never seen syntax like that before. Does it check all the items in match, then see if the line has that in it? – Retroflux Jun 03 '16 at 14:51
  • @Retroflux: I added a link to a description of list comprehension and added a much simpler implementation. – eandersson Jun 03 '16 at 14:56