6

I have a large Dataset (dataframe) where I want to find the number and the names of my cartegories in a column.

For example my df was like that:

 A   B   
 1   car
 2   car
 3   bus
 4   car
 5   plane 
 6   plane 
 7   plane 
 8   plane 
 9   plane 
 10   train

I would want to find :

  car
  bus
  plane
  train
  4

How would I do that?

user3443063
  • 1,455
  • 4
  • 23
  • 37
  • 1
    What do you mean with `number and names`? What number? For instance, where does the 4 come from? If you mean frequencies, you may want to use something like `table(df$B)`. – coffeinjunky Sep 02 '17 at 20:25

6 Answers6

25
categories <- unique(yourDataFrame$yourColumn) 
numberOfCategories <- length(categories)

Pretty painless.

CCD
  • 590
  • 3
  • 8
11

This gives unique, length of unique, and frequency:

table(df$B)
bus   car plane train 
1     3     5     1

length(table(x$B))
[1] 4
8

You can simply use unique:

x <- unique(df$B)

And it will extract the unique values in the column. You can use it with apply to get them from each column too!

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
2

I would recommend you use factors here, if you are not already. It's straightforward and simple.

levels() gives the unique categories and nlevels() gives the number of them. If we run droplevels() on the data first, we take care of any levels that may no longer be in the data.

with(droplevels(df), list(levels = levels(B), nlevels = nlevels(B)))
# $levels
# [1] "bus"   "car"   "plane" "train"
#
# $nlevels
# [1] 4
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

Additionally, to see sorted values you can use the following:

sort(table(df$B), decreasing = TRUE)

And you will see the values in the decreasing order.

V C
  • 39
  • 3
0

Firstly you must ensure that your column is in the correct data type. Most probably R had read it in as a 'chr' which you can check with 'str(df)'. For the data you have provided as an example, you will want to change this to a 'factor'. df$column <- as.factor(df$column) Once the data is in the correct format, you can then use 'levels(df$column)' to get a summary of levels you have in the dataset