1

I have this piece of code which checks conditions:

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row)
    for word in NEGATIVE_WORDS:
        if word in add_data.lower():
            return False
    for word in POSITIVE_WORDS:
        if word in add_data.lower():
            return True
    return False

This is quite hard to follow (in my opinion), so I was wondering if anyone can suggest something more pythonic with shorter lines? Could I for example merge the two for loops? If I merge two for loops, would it consumes more time?

CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • 1
    You can replace each of those loops with [a call to `any` and a generator expression](https://stackoverflow.com/questions/1342601/pythonic-way-of-checking-if-a-condition-holds-for-any-element-of-a-list). – Aran-Fey Aug 30 '17 at 09:09
  • I think if you find *that* hard to follow, you have a problem. I think that's already the simplest way. – Stefan Pochmann Aug 30 '17 at 09:10
  • @StefanPochmann Indeed.. I go get some coffee/ – Ma0 Aug 30 '17 at 09:15

3 Answers3

2

That is more compact and short-circuits due to the anys much like your explicit loops did.

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row)
    if any(word in add_data.lower() for word in NEGATIVE_WORDS):  # negative check takes precedence.
        return False
    if any(word in add_data.lower() for word in POSITIVE_WORDS):
        return True
    return False

A couple of things on that:

  • why do you call .lower() when searching for NEGATIVE_WORDS and not for POSITIVE_WORDS?
  • if add_data contain both NEGATIVE_WORDS and POSITIVE_WORDS, the order of the last two ifs will affect the outcome. This is not good practice.
Ma0
  • 15,057
  • 4
  • 35
  • 65
  • 1
    Why exactly second dot wouldn't be good practice? I can imagine those conditions as filters, where negative filter is stronger and rejects wrong sequences in first place and *only after that* positive filter can be applied. – erhesto Aug 30 '17 at 09:26
  • Didn't you just trashtalk about `if ...: return False`? :-P Here you can *actually* rewrite that. To `not any(...) and any(...)`. Or at least replace the second `if` with `return any(...)`. – Stefan Pochmann Aug 30 '17 at 09:27
  • Your first question: I forgot the .lower() when i simplified the code a little for this forum. Your second question is answered by @erhesto , if it contains negative words I don't want it in the first place so I don't waste time checking for positive words. Further: Nice solution! – CodeNoob Aug 30 '17 at 09:29
  • @erhesto because it looks fishy to other people. If this code was to be seen by others too, I might add a comment there. – Ma0 Aug 30 '17 at 09:33
  • oups, then please don't rewrite it to `return not any(...) and any(...)` because it's simply not as readable (in this case) as explicit `if` and `return`s. – MSeifert Aug 30 '17 at 09:36
  • @MSeifert Well, like I commented under the question... they said they already find their own original "quite hard to follow". Do you think the solutions using `any` will be easier for them? Clearly they're a beginner and in my experience, beginners find stuff like `any` and generators *harder* than for loops... – Stefan Pochmann Aug 30 '17 at 09:41
  • 2
    @Ev.Kounis Also, what happened to checking against NEGATIVE_WORDS (in your last few solutions)? Maybe you should go get *more* coffee :-) – Stefan Pochmann Aug 30 '17 at 09:45
  • @StefanPochmann I felt like presenting the process of condensing the code step-by-step its what would help OP more. The check for negative words can be merged with finding neither negative nor positive as it says. – Ma0 Aug 30 '17 at 09:47
2

This is quite hard to follow (in my opinion), so I was wondering if anyone can suggest something more pythonic with shorter lines?

Generally pythonic doesn't mean shorter lines. Pythonic code could should be easy to read and follow (at least with a bit of background). So if you find that hard to read you could factor it into a different function:

# I'm not sure if the function name is a good fit, it's just a suggestion.
def contains_at_least_one(data, words):  
    for word in words:
        if word in data:
            return True
    return False

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    if contains_at_least_one(add_data, NEGATIVE_WORDS):
        return False
    if contains_at_least_one(add_data, POSITIVE_WORDS):
        return True
    return False

Could I for example merge the two for loops?

Not really. Because the NEGATIVE_WORDS loop should take precedence (at least in your code) over the POSITIVE_WORDS loop. Except you meant factoring it in a function. Then see first.

If I merge two for loops, would it consumes more time?

I'm not sure what you meant by "merging" the loops but if you want it shorter you could use any in the approach above. It's equivalent to the for-loop and shorter - but, according to my and StefanPochmans benchmarks, slower:

def contains_at_least_one(data, words):
    return any(word in data for word in words)

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    if contains_at_least_one(add_data, NEGATIVE_WORDS):
        return False
    if contains_at_least_one(add_data, POSITIVE_WORDS):
        return True
    return False

You could even reduce the number of lines by using and for return. I would not recommend it because such constructs don't improve the readability, but that's your decision and it's one way to "shorten" code:

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = get_additional_data(data_row).lower()
    return (not contains_at_least_one(add_data, NEGATIVE_WORDS) and
            contains_at_least_one(add_data, POSITIVE_WORDS))

A bit far fetched but maybe you could even use sets to speed this up. This would require that you only look for whole word matches (not partial matches, not multi-word matches):

def contains_at_least_one(data, words):
    return data.intersection(words)

def is_important(data_row):
    if data_row.get('important', None):
        return True
    add_data = set(get_additional_data(data_row).lower().split())  # set and split!
    return not contains_at_least_one(add_data, NEGATIVE_WORDS) and contains_at_least_one(add_data, POSITIVE_WORDS)

See also the regex suggestions in tobias_k's answer if you don't want punctuation to break your matching. However the set approach is only meant as "small suggestion" - I doubt it could be applied in your case. But you need to judge that.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • 1
    Nice extra function! But I believe you're wrong about `any` being faster than the `for`-loop. Why would it be faster? And I tested them, three times each, alternating between them. The `for`-loop was slightly **faster**. Its times were 12.46 to 12.51 seconds while the `any` times were 12.61 to 12.63 seconds. My test: `timeit(lambda: contains_at_least_one(data, words), number=1000)` with `data = list(map(str, range(1000)))` and `words = list(map(str, range(1000, 2000)))`. – Stefan Pochmann Aug 30 '17 at 10:58
  • @StefanPochmann Oups, I walked right into it. Normally I always benchmark before I make such statements. I'll correct it. :) – MSeifert Aug 30 '17 at 11:03
  • Oups as well, I just remembered that my `data` should be a string, not a list. But I don't think that it affects my point. – Stefan Pochmann Aug 30 '17 at 11:05
  • Yes, it's the generator overhead that an explicit loop can avoid. That generator overhead is dominating even if `any` performs at C speed. For long `*_WORDS` it's almost 2 times faster to do the loop explicitly. – MSeifert Aug 30 '17 at 11:06
  • @StefanPochmann It affects the point in a way because `in` is `O(n)` which makes the helper function `O(n*m)` so the constant factors (like generator overhead) are less important. – MSeifert Aug 30 '17 at 11:10
  • Hmm... I also just tried `any(map(data.__contains__, words))`. **That** I **did** expect to be faster. But it was *even slower*, taking 12.60, 12.84 and 12.90 seconds. What do you think about this? (I'm of course using Python 3, btw) – Stefan Pochmann Aug 30 '17 at 11:10
  • @StefanPochmann For short iterables `map` creates an iterator (increasing the constant factor) for long iterables it could be faster (because the constant factor of the function call overhead is reduced for bound methods implemented in C), however I've seen a regression in `map` for bound methods in 3.6 (see also https://stackoverflow.com/questions/45601663/python-3-5-vs-3-6-what-made-map-slower-compared-to-comprehensions). All of that could factor in here. – MSeifert Aug 30 '17 at 11:13
1

Besides using any, you could also combine the different conditions into a single return statement, though whether that's clearer might be a matter of opinion.

def is_important(data_row):
    add_data = get_additional_data(data_row)
    return (data_row.get('important', None)
        or (not any(word in add_data.lower() for word in NEGATIVE_WORDS)
            and any(word in add_data         for word in POSITIVE_WORDS)))

Though if get_additional_data is expensive, you might keep that first if separate.

Also, you can probably speed up the check by converting add_data to a set of (lowercase) words first, but this changes the logic slightly, as this will e.g. not match word-fragments.

def is_important(data_row):
    add_data = set((word.lower() for word in get_additional_data(data_row).split()))
    return (data_row.get('important', None)
        or (not any(word in add_data for word in NEGATIVE_WORDS)
            and any(word in add_data for word in POSITIVE_WORDS)))

Or, instead of .split(), use e.g. re.findall(r"\w+") to find words without punctuation.

Depending on the size of the positive- and negative-lists is might also pay off to invert the check, e.g. any(word in POSITIVE_WORDS for word in add_data.split()), particularly if those are already set structures with fast lookup.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • Not really readable code (but you already mentioned that) but nice explanations! Just one additional note: It wouldn't match words containing whitespaces of any kind as well as word-fragments when you convert to sets. :) – MSeifert Aug 30 '17 at 09:39
  • @MSeifert Of course, that entirely depends on the words in those lists, whether they can contain whitespace or puctuations, and of course whether word-fragments should also be matched. – tobias_k Aug 30 '17 at 09:45