8

I am fairly new to pandas and come from a statistics background and I am struggling with a conceptual problem: Pandas has columns, who are containing values. But sometimes values have a special meaning - in a statistical program like SPSS or R called a "value labels".

Imagine a column rain with two values 0 (meaning: no rain) and 1 (meaning: raining). Is there a way to assign these labels to that values?

Is there a way to do this in pandas, too? Mainly for platting and visualisation purposes.

buhtz
  • 10,774
  • 18
  • 76
  • 149
Christian Sauer
  • 10,351
  • 10
  • 53
  • 85
  • Do you want to store the values as strings or assign some special meaning later? i.e. use a lookup or add a new column that maps the values to human friendly values? Or do you just want this information in the legend of your plot? – EdChum Mar 19 '14 at 08:31
  • 1
    @EdChum Ideally, I want no new column at all - e.g. in SPSS the label is frequently used for displaying data in tables, plots etc. but you can use the numeric value for conditional. At my work, I often have variables with 30+ different "labels" per column - having the associated strings visible would be huge help (e.g. avoiding the "what was the meaning of 21?"-question) – Christian Sauer Mar 19 '14 at 08:38
  • You could add it as an attribute which is general to Python and not specific to Pandas and access it for your plots see related: http://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe – EdChum Mar 19 '14 at 08:42
  • 1
    That would probably not be used by any normal porcudeure, but thanks for the suggestion! – Christian Sauer Mar 19 '14 at 09:36

3 Answers3

6

There's not need to use a map anymore. Since version 0.15, Pandas allows a categorical data type for its columns. The stored data takes less space, operations on it are faster and you can use labels.

I'm taking an example from the pandas docs:

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
#Recast grade as a categorical variable
df["grade"] = df["raw_grade"].astype("category")

df["grade"]

#Gives this:
Out[124]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

You can also rename categories and add missing categories

cd98
  • 3,442
  • 2
  • 35
  • 51
4

You could have a separate dictionary which maps values to labels:

d={0:"no rain",1:"raining"}

and then you could access the labelled data by doing

df.rain_column.apply(lambda x:d[x])
xxx
  • 1,153
  • 1
  • 11
  • 23
grasshopper
  • 3,988
  • 3
  • 23
  • 29
  • 2
    `map` might be better for this simple case – EdChum Mar 19 '14 at 09:30
  • What is the difference in this case? – grasshopper Mar 19 '14 at 09:34
  • 3
    Only better in terms of simpler syntax: `df.rain_column.map(d)`, and perhaps faster performance-wise, it depends on data size and type for a dataframe with 100 rows then `apply` is marginally faster (apply 228 us vs map 287us), for one with 10000 rows then map is 26 times faster (map is 512 us vs apply 13 ms) – EdChum Mar 19 '14 at 10:10
  • Alright, this makes a lot of sense, since apply is more general purpose than map. – grasshopper Mar 19 '14 at 10:12
  • I will accept cd98 answer which is better for newer versions of pandas, if that's ok for you. – Christian Sauer Sep 24 '15 at 08:30
0

Map is nice if you do not have the catgories baked into the dataframe.

rainCategories = {1: "raining", 0: "no rain"}
dfRain = pd.DataFrame({"RainFall":[0,1,1,1,0],"day":["M","T","W","R","F"]})
dfRain["rainFall"].map(rainCategories).value_counts()

outputs:

RainFall
raining    3
no rain    2
Name: count, dtype: int64
w. Patrick Gale
  • 1,643
  • 13
  • 22