3

I am trying to write a simple record classifier. I want to add a column whose value classifies a record. I want to codify my classification rules in a yaml, or similar file for maintenance purposes.

I am using Pandas as that seems to be the best way to do this with csv records in python. I am open to other suggestions. I am new to pandas and my python skills are politely described as "why does this look like perl?"

I've gotten a dataframe (trans) and I want to apply my rules as follows:

trans['class'][(trans['foo'] > 5) & (trans['bar'].str.contains(re.compile('baz|one|two', re.I))] = 'Record Type 1'

This works interactively. I would like to be able to generate the classifying index, "(trans['foo'] > 5) & (trans['bar'].str.contains(re.compile('baz|one|two', re.I))" dynamically from each rule in my yaml file. I have successfully built strings such that I have things like:

slice = "(trans['foo'] > 5) & (trans['bar'].str.contains(re.compile('baz|one|two', re.I))" trans['class'][slice] = 'Record Type 1'

This doesn't work. What should I be doing instead?

jpp
  • 159,742
  • 34
  • 281
  • 339
bex
  • 103
  • 1
  • 5

2 Answers2

2

Some points to note:

  1. Quotation marks denote strings in Python. Don't use them to surround calculation of Boolean masks.
  2. Don't use chained indexing. It's explicitly discouraged in the docs and can lead to unexpected side-effects, or ambiguity as to whether you are modifying a view or a copy. You can use pd.DataFrame.loc instead.
  3. pd.Series.str.contains already supports regex and defaults to regex=True, you don't need to use the re module.

For readability, you can split and combine masks. Here's an example:

m1 = trans['foo'] > 5
m2 = trans['bar'].str.contains('baz|one|two', case=False)

trans.loc[m1 & m2, 'class'] = 'Record Type 1'

The usually expensive part, calculation of m2, can be optimized by resorting to specialist algorithms, see this answer for details.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • How would you handle a situation where there are unknown number of conditions until run-time? i.e. there could be a need for m1, m1&m2 or m1&m2&m3 and so on. Can you iterate on this like m = m & new_condition? – bex Oct 09 '18 at 20:31
  • Sure, you can link an arbitrary number of Boolean series in this way, e.g. `m = m1 & m2 & (m3 | m4) & (m5 | m6)`, then calculate `m = m & m7`, etc.. – jpp Oct 09 '18 at 22:32
  • 1
    Thank you - I clicked up and forgot the check - doh! This looks super helpful. Thank you. – bex Oct 10 '18 at 07:16
1

I think you shouldn't put the condition inside the quote. So it should be

slice = (trans[`foo`] > 5) & (trans['bar'].str.contains(re.compile('baz|one|two', re.I)))
trans['class'][slice] = "Record Type 1"
ipramusinto
  • 2,310
  • 2
  • 14
  • 24
  • If the expression came as a string `s`, you can do `slice = eval(s)`. And notes from @jpp are all great and worth to consider, especially number 2nd. – ipramusinto Oct 08 '18 at 09:39
  • I have built my own string in code, the eval is what was missing. I misunderstood what the slice was representing in the chained logic. I read the chained conversation in brief in the docs and decided it didn't apply to my situation as I wasn't concerned about overwriting the dataframe. That said, I am planning to avoid it now that I know how. – bex Oct 10 '18 at 07:49