0

I have a working routine to determine the categories a news item belongs to. The routine works when assigning values in Python for the title, category, subcategory, and the search words as RegExp.

But when retrieving these values from PostgreSQL as strings I do not get any errors, or results from the same routine.

I checked the datatypes, both are Python strings.

What can be done to fix this?

# set the text to be analyzed
title = "next week there will be a presentation. The location will be aat"

# these could be the categories
category = "presentation"
subcategory = "scientific"

# these are the regular expressions
main_category_search_words = r'\bpresentation\b'
sub_category_search_words= r'\basm microbe\b | \basco\b | \baat\b'

category_final = ''
subcategory_final = ''

# identify main category
r = re.compile(main_category_search_words, flags=re.I | re.X)
result = r.findall(title)

if len(result) == 1:
    category_final = category

    # identify sub category
    r2 = re.compile(sub_category_search_words, flags=re.I | re.X)
    result2 = r2.findall(title)
    if len(result2) > 0:
        subcategory_final = subcategory

print("analysis result:", category_final, subcategory_final)
Zoe
  • 27,060
  • 21
  • 118
  • 148
mgo
  • 65
  • 1
  • 9

1 Answers1

1

I'm pretty sure that what you get back from PostgreSQL is not a raw string literal, hence your RegEx is invalid. You will have to escape the backslashes in your pattern explicitly in the DB.

print(r"\basm\b")
print("\basm\b")
print("\\basm\\b")

# output
\basm\b

as       # yes, including the line break above here
\basm\b
Zoe
  • 27,060
  • 21
  • 118
  • 148
shmee
  • 4,721
  • 2
  • 18
  • 27
  • thanks this definately sheds some light on what should be corrected!. as a test I changed the DB main category entry in PostgreSQL to \\bpresentation\\b and then r = re.compile(r"'"+main_category_search_words+"'", flags=re.I | re.X) but no result. i think i'm close, but not sure where to proceed from here. Advice is very welcome! :) – mgo Jun 11 '18 at 11:21
  • You can print your compiled expression to verify it is what you are looking for. To me it looks like you now end up with `'\bams\b'` (including the single quotes) due to your string concatenation. I don't thing that concatenation is necessary at all since you already changed the DB value. – shmee Jun 11 '18 at 12:14
  • Shmee thanks you pushed me in the right direction, and now it works! – mgo Jun 11 '18 at 13:07
  • just want to point out to anyone in the future having this challenge. To get the raw string into Python from PostgreSQL i used r""+search_words because without it the string is not seen as raw: r = re.compile(r""+main_category_search_words, flags=re.I | re.X) – mgo Jun 11 '18 at 13:15