7

Tl;dr is bold-faced text.

I'm working with an image dataset that comes with boolean "one-hot" image annotations (Celeba to be specific). The annotations encode facial features like bald, male, young. Now I want to make a custom one-hot list (to test my GAN model). I want to provide a literate interface. I.e., rather than specifying features[12]=True knowing that 12 - counting from zero - corresponds to the male feature, I want something like features[male]=True or features.male=True.

Suppose the header of my .txt file is

Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young

and I want to codify Young, Bald, and Chubby. The expected output is

[ 0.  0.  0.  1.  0.  1.  0.  0.  1.]

since Bald is the fourth entry of the header, Chubby is the sixth, and so on. What is the clearest way to do this without expecting a user to know Bald is the fourth entry, etc.?

I'm looking for a Pythonic way, not necessarily the fastest way.

Ideal Features

In rough order of importance:

  1. A way to accomplish my stated goal that is already standard in the Python community will take precedence.
  2. A user/programmer should not need to count to an attribute in the .txt header. This is the point of what I'm trying to design.
  3. A user should not be expected to have non-standard libraries like aenum.
  4. A user/programmer should not need to reference the .txt header for attribute names/available attributes. One example: if a user wants to specify the gender attribute but does not know whether to use male or female, it should be easy to find out.
  5. A user/programmer should be able to find out the available attributes via documentation (ideally generated by Sphinx api-doc). That is, the point 4 should be possible reading as little code as possible. Attribute exposure with dir() sufficiently satisfies this point.
  6. The programmer should find the indexing tool natural. Specifically, zero-indexing should be preferred over subtracting from one-indexing.
  7. Between two otherwise completely identical solutions, one with better performance would win.

Examples:

I'm going to compare and contrast the ways that immediately came to my mind. All examples use:

import numpy as np
header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
          "Bald Bangs Chubby Male Wearing_Necktie Young")
NUM_CLASSES = len(header.split())  # 9

1: Dict Comprehension

Obviously we could use a dictionary to accomplish this:

binary_label = np.zeros([NUM_CLASSES])
classes = {head: idx for (idx, head) in enumerate(header.split())}
binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True
print(binary_label)

For what it's worth, this has the fewest lines of code and is the only one that doesn't rely on a standard library over builtins. As for negatives, it isn't exactly self-documenting. To see the available options, you must print(classes.keys()) - it's not exposed with dir(). This borders on not satisfying feature 5 because it requires a user to know classes is a dict to exposure features AFAIK.

2: Enum:

Since I'm learning C++ right now, Enum is the first thing that came to mind:

import enum
binary_label = np.zeros([NUM_CLASSES])
Classes = enum.IntEnum("Classes", header)
features = [Classes.Young, Classes.Bald, Classes.Chubby]
zero_idx_feats = [feat-1 for feat in features]
binary_label[zero_idx_feats] = True
print(binary_label)

This gives dot notation and the image options are exposed with dir(Classes). However, enum uses one-indexing by default (the reason is documented). The work-around makes me feel like enum is not the Pythonic way to do this, and entirely fails to satisfy feature 6.

3: Named Tuple

Here's another one out of the standard Python library:

import collections
binary_label = np.zeros([NUM_CLASSES])
clss = collections.namedtuple(
    "Classes", header)._make(range(NUM_CLASSES))
binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True
print(binary_label)

Using namedtuple, we again get dot notation and self-documentation with dir(clss). But, the namedtuple class is heavier than enum. By this I mean, namedtuple has functionality I do not need. This solution appears to be a leader among my examples, but I do not know if it satisfies feature 1 or if an alternative could "win" via feature 7.

4: Custom Enum

I could really break my back:

binary_label = np.zeros([NUM_CLASSES])
class Classes(enum.IntEnum):
    Arched_Eyebrows = 0
    Attractive = 1
    Bags_Under_Eyes = 2
    Bald = 3
    Bangs = 4
    Chubby = 5
    Male = 6
    Wearing_Necktie = 7
    Young = 8
binary_label[
    [Classes.Young, Classes.Bald, Classes.Chubby]] = True
print(binary_label)

This has all the advantages of Ex. 2. But, it comes with obvious the obvious drawbacks. I have to write out all the features (there's 40 in the real dataset) just to zero-index! Sure, this is how to make an enum in C++ (AFAIK), but it shouldn't be necessary in Python. This is a slight failure on feature 6.

Summary

There are many ways to accomplish literate zero-indexing in Python. Would you provide a code snippet of how you would accomplish what I'm after and tell me why your way is right?

(edit:) Or explain why one of my examples is the right tool for the job?


Status Update:

I'm not ready to accept an answer yet in case anyone wants to address the following feedback/update, or any new solution appears. Maybe another 24 hours? All the responses have been helpful, so I upvoted everyone's so far. You may want to look over this repo I'm using to test solutions. Feel free to tell me if my following remarks are (in)accurate or unfair:

zero-enum:

Oddly, Sphinx documents this incorrectly (one-indexed in docs), but it does document it! I suppose that "issue" doesn't fail any ideal feature.

dotdict:

I feel that Map is overkill, but dotdict is acceptable. Thanks to both answerers that got this solution working with dir(). However, it doesn't appear that it "works seamlessly" with Sphinx.

Numpy record:

As written, this solution takes significantly longer than the other solutions. It comes in at 10x slower than a namedtuple (fastest behind pure dict) and 7x slower than standard IntEnum (slowest behind numpy record). That's not drastic at current scale, nor a priority, but a quick Google search indicates np.in1d is in fact slow. Let's stick with

_label = np.zeros([NUM_CLASSES])
_label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True

unless I've implemented something wrong in the linked repo. This brings the execution speed into a range that compares with the other solutions. Again, no Sphinx.

namedtuple (and rassar's critiques)

I'm not convinced of your enum critique. It seems to me that you believe I'm approaching the problem wrong. It's fine to call me out on that, but I don't see how using the namedtuple is fundamentally different from "Enum [which] will provide separate values for each constant." Have I misunderstood you?

Regardless, namedtuple appears in Sphinx (correctly numbered, for what it's worth). On the Ideal Features list, this chalks up identically to zero-enum and profiles ahead of zero-enum.

Accepted Rationale

I accepted the zero-enum answer because the answer gave me the best challenger for namedtuple. By my standards, namedtuple is marginally the best solution. But salparadise wrote the answer that helped me feel confident in that assessment. Thanks to all who answered.

Dylan F
  • 1,295
  • 12
  • 14
  • 2
    The only drawback you gave for `namedtuple` is that it's "too heavy". You didn't make it sound like it was missing features you need, nor that the extra features would *actively get in the way*. So to me, this is the clear and obvious winner. I don't think there is anything more Pythonic than taking a well-known, well-understood standard library feature which meets all your requirements and just using it. – John Y Dec 27 '17 at 22:41
  • Of your examples, 3 is the most pythonic - 1 is not explicit, 4 is too explicit, and generally enums are not pythonic for things like this, especially with the indexing difference. 3 is more readable, uses well-known stdlib, and has no real drawbacks. – rassar Dec 27 '17 at 22:50
  • @JohnY See the edit. Your comment would serve as an answer to how I intended the question. – Dylan F Dec 27 '17 at 22:58
  • @rassar See my comment to JohnY above. – Dylan F Dec 27 '17 at 22:58
  • 2
    This seems to be an incomplete/too subjective question. In each of the cases you present disparate and sometimes conflicting arguments. For the dictionary example you mention preference for built-ins over standard library. Then you mention need to use `dir()`. For namedtuple you mention being too heavy. If you can clearly state what the goal of the program is from the programmer (you) and the user point of view, i.e. how the program is meant to be used, why the user has to use `dir()` and not `classes.keys()` then it will be easier to answer. Maybe a list of required features too? – SigmaPiEpsilon Dec 27 '17 at 23:31
  • @SigmaPiEpsilon Valid criticism. My initial thought was to leave the question open to general comparison of advantages and disadvantages. But, a somewhat-subjective-by-nature question needs more rigor. I tried to address your concerns with an edit. – Dylan F Dec 28 '17 at 00:16

4 Answers4

3

How about a factory function to create a zero indexed IntEnum since that is the object that suits your needs, and Enum provides flexibility in construction:

from enum import IntEnum

def zero_indexed_enum(name, items):
    # splits on space, so it won't take any iterable. Easy to change depending on need.
    return IntEnum(name, ((item, value) for value, item in enumerate(items.split())))

Then:

In [43]: header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
    ...:           "Bald Bangs Chubby Male Wearing_Necktie Young")
In [44]: Classes = zero_indexed_enum('Classes', header)

In [45]: list(Classes)
Out[45]:
[<Classes.Arched_Eyebrows: 0>,
 <Classes.Attractive: 1>,
 <Classes.Bags_Under_Eyes: 2>,
 <Classes.Bald: 3>,
 <Classes.Bangs: 4>,
 <Classes.Chubby: 5>,
 <Classes.Male: 6>,
 <Classes.Wearing_Necktie: 7>,
 <Classes.Young: 8>]
salparadise
  • 5,699
  • 1
  • 26
  • 32
2

You can use a custom class which I like to call as DotMap or as mentioned here is this SO discussion as Map:

About Map:

  • It has the features of a dictionary since the input to a Map/DotMap is a dict. You can access attributes using features['male'].
  • Additionally you can access the attributes using dot i.e. features.male and the attributes will be exposed when you do dir(features).
  • It is only as heavy as it needs to be in order to enable the dot functionality.
  • Unlike namedtuple you don't need to pre-define it and you can add and remove keys willy nilly.
  • The Map function described in the SO question is not Python3 compatible because it uses iteritems(). Just replace it with items() instead.

About dotdict:

  • dotdict provides the same advantages of Map with the exception that it does not override the dir() method therefore you will not be able to obtain the attributes for documentation. @SigmaPiEpsilon has provided a fix for this here.
  • It uses the dict.get method instead of dict.__getitem__ therefore it will return None instead of throwing KeyError when you are access attributes that don't exist.
  • It does not recursively apply dotdict-iness to nested dicts therefore you won't be able to use features.foo.bar.

Here's the updated version of dotdict which solves the first two issues:

class dotdict(dict):
    __getattr__ = dict.__getitem__  # __getitem__ instead of get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__
    def __dir__(self):              # by @SigmaPiEpsilon for documentation
        return self.keys()

Update

Map and dotdict don't have the same behavior as pointed out by @SigmaPiEpsilon so I added separate descriptions for both.

nitred
  • 5,309
  • 3
  • 25
  • 29
  • Defining an entire new class might be overkill for this question. – rassar Dec 28 '17 at 00:21
  • 1
    @rassar OP seemed ok with it since method-4 in the question is a custom class. And the second link i.e. `shorter lighter version` is pretty small anyway. – nitred Dec 28 '17 at 00:26
  • +1 for a creative response and discussing advantages! I agree that the "shorter lighter version" could be appropriate. – Dylan F Dec 28 '17 at 00:29
  • These methods are not sufficient on their own as none of these classes override `__dir__()` which is what is called by `dir()`. Using `dir()` on these instances will only give the attributes of python `dict`. The primary need for dot access as I understood was to facilitate easy documentation through sphinx by `dir()`. See my answer below. – SigmaPiEpsilon Dec 28 '17 at 04:23
1

Of your examples, 3 is the most pythonic answer to your question.

1, as you said, does not even answer your question, since the names are not explicit.

2 uses enums, which though being in the standard library are not pythonic and generally not used in these scenarios in Python. (Edit): In this case you only really need two different constants - the target values and the other ones. An Enum will provide separate values for each constant, which is not what the goal of your program is and seems to be a roundabout way of approaching the problem.

4 is just not maintainable if a client wants to add options, and even as it is it's painstaking work.

3 uses well-known classes from the standard library in a readable and succinct way. Also, it does not have any drawbacks, as it is perfectly explicit. Being too "heavy" doesn't matter if you don't care about performance, and anyway the lag will be unnoticeable with your input size.

rassar
  • 5,412
  • 3
  • 25
  • 41
  • I agree with your assessment and I appreciate additional feedback on my examples. I'm curious: Why do you say ``enums`` are not Pythonic? Could you maybe include a short, good use for enums if they're not for enumerating lists? I understand this is an additional question, but it would support your argument for ``namedtuple``. – Dylan F Dec 28 '17 at 00:36
1

Your requirements if I understand correctly can be divided into two parts:

  1. Access the position of header elements in the .txt by name in the most pythonic way possible and with minimum external dependencies

  2. Enable dot access to the data structure containing the names of the headers to be able to call dir() and setup easy interface with Sphinx

Pure Python Way (no external dependencies)

The most pythonic way to solve the problem is of course the method using dictionaries (dictionaries are at the heart of python). Searching a dictionary through key is also much faster than other methods. The only problem is this prevents dot access. Another answer mentions the Map and dotdict as alternatives. dotdict is simpler but it only enable dot access, it will not help in the documentation aspect with dir() since dir() calls the __dir__() method which is not overridden in these cases. Hence it will only return the attributes of Python dict and not the header names. See below:

>>> class dotdict(dict):
...     __getattr__ = dict.get
...     __setattr__ = dict.__setitem__
...     __delattr__ = dict.__delitem__
... 
>>> somedict = {'a' : 1, 'b': 2, 'c' : 3}                                                                                                          
>>> somedotdict = dotdict(somedict)
>>> somedotdict.a
1
>>> 'a' in dir(somedotdict)
False

There are two options to get around this problem.

Option 1: Override the __dir__() method like below. But this will only work when you call dir() on the instances of the class. To make the changes apply for the class itself you have to create a metaclass for the class. See here

#add this to dotdict
def __dir__(self):
    return self.keys()

>>> somedotdictdir = dotdictdir(somedict)
>>> somedotdictdir.a
1
>>> dir(somedotdictdir)
['a', 'b', 'c']

Option 2: A second option which makes it much closer to user-defined object with attributes is to update the __dict__ attribute of the created object. This is what Map also uses. A normal python dict does not have this attribute. If you add this then you can call dir() to get attributes/keys and also all the additional methods/attributes of python dict. If you just want the stored attribute and values you can use vars(somedotdictdir) which is also useful for documentation.

class dotdictdir(dict):

    def __init__(self, *args, **kwargs):
        dict.__init__(self, *args, **kwargs)
        self.__dict__.update({k : v for k,v in self.items()})
    def __setitem__(self, key, value):
        dict.__setitem__(self, key, value)
        self.__dict__.update({key : value})
    __getattr__ = dict.get #replace with dict.__getitem__ if want raise error on missing key access
    __setattr__ = __setitem__
    __delattr__ = dict.__delitem__

>>> somedotdictdir = dotdictdir(somedict)
>>> somedotdictdir
{'a': 3, 'c': 6, 'b': 4}
>>> vars(somedotdictdir)
{'a': 3, 'c': 6, 'b': 4}
>>> 'a' in dir(somedotdictdir)
True

Numpy way

Another option will be to use a numpy record array which allows dot access. I noticed in your code you are already using numpy. In this case too __dir__() has to be overrridden to get the attributes. This may result in faster operations (not tested) for data with lots of other numeric values.

>>> headers = "Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young".split()
>>> header_rec = np.array([tuple(range(len(headers)))], dtype = zip(headers, [int]*len(headers)))
>>> header_rec.dtype.names                                                                                                                           
('Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 'Chubby', 'Male', 'Wearing_Necktie', 'Young')
>>> np.in1d(header_rec.item(), [header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]).astype(int)
array([0, 0, 0, 1, 0, 1, 0, 0, 1])

In Python 3, you will need to use dtype=list(zip(headers, [int]*len(headers))) since zip became its own object.

SigmaPiEpsilon
  • 678
  • 5
  • 15