0

I have a json file containing terms to check against for a profanity filter.

["bad", "word", "plug"]

And I am using this (found from another article) to parse the json and search any data object for set words.

def word_filter(self, *field_names):

    import json
    from pprint import pprint

    with open('/var/www/groupclique/website/swearWords.json') as data_file:    
        data = json.load(data_file)

    for field_name in field_names:
        for term in data:
            if term in field_name:
                self.add_validation_error(
                    field_name,
                    "%s has profanity" % field_name)


class JobListing(BaseProtectedModel):
    id = db.Column(db.Integer, primary_key=True)
    category = db.Column(db.String(255))
    job_title = db.Column(db.String(255))

    @before_flush
    def clean(self):
        self.word_filter('job_title')  

The issue is if I use the string "plumber" it fails the check due to the word "plug" in the json file. Because "plu" being in both terms. Is there any way to force the entire word in the json file to be used instead of a partial? Output once ran isnt erroneous:

({ "validation_errors": { "job_title": " job_title has profanity" } })

HTTP PAYLOAD:
{
    "job_title":"plumber",    
}
univerio
  • 19,548
  • 3
  • 66
  • 68
draxous
  • 59
  • 2
  • 10
  • Could you add the `field_names` definition to your code segment (or a subset of it if it is very large). Also your output when you run the code – Dr K Jul 28 '16 at 17:49
  • I believe I added what you asked for. – draxous Jul 28 '16 at 18:08
  • I think that perhaps the data you are passing around is not of the type that you think it is. If you call `word_filter('job_title')` then inside the `word_filter` method, `field_names` has the contents `('job_title',)`, i.e. a tuple with one string in it. Your `if term in field_name` test is then equivalent to `'plug' in 'test_title'`, which basically means "Does 'plug' exist as a substring in the string 'test_title'?". I don't think that is what you are trying to do... or am I wrong? – Dr K Jul 28 '16 at 18:26
  • You're checking if the bad word is in each `field_name`, not the data in the fields specified by `field_name`. Try `if term in getattr(self, field_name)` instead. – univerio Jul 28 '16 at 21:50

1 Answers1

2

You can use string.split() as a way to isolate whole words of the field_name. When you split, it returns a list of each part of the string split up by the specified delimiter. Using that, you can check if the profane term is in the split list:

import json

with open('terms.json') as data_file:    
    data = json.load(data_file)

for field_name in field_names:
    for term in data:
        if term in field_name.split(" "):
            self.add_validation_error(
                field_name,
                "%s has profanity" % field_name)

Where this gets dicey is if there is punctuation or something of the sort. For example, the sentence: "Here comes the sun." will not match the bad word "sun", nor will it match "here". To solve the capital problem, you'll want to change the entire input to lowercase:

if term in field_name.lower().split(" "):

Removing punctuation is a bit more involved, but this should help you implement that.

There may well be more edge cases you'll need to consider, so just a heads up on two quick ones I thought of.

Matt Yaple
  • 21
  • 4
  • When I use this suggestion it works if I use the single word 'plumber' so its not checking just 'plu' anymore, but if I use a direct word such as 'bad' or 'word' it passes validation with no error where there should be one. PS. this could be any type of text field varchar, longtext, etc. So there could be paragraphs of words to check. I just want the filter to check for any usage of any kind of direct words in the filter file. Not partials. – draxous Jul 28 '16 at 18:07