Working with Dataframes within R, what is level and factor

Question

Can someone help me with factors and levels within a dataframe please? I am very confused about how this works.

Here is what I am trying to do --> How to add two rows into df.empty that has the RIGHT type of data:

df.empty <- data.frame(column1 = numeric(), column2 = character(), column3 = factor())
df.empty$column3<-factor(df.empty$column3,levels=c("A","B","C"))

I tried two things:

newRow <- c(-2,"MyString","B")
incorrectRow <- c(-2,"MyString","C")

The first one worked and second one did not, I can't figure out why. they are the same format, I tried changing the "C" to "B" or "A", still doesn't work.

I think this has something to do with the levels =c("A","B",C") code above, but not sure how.

you can add the data in and set the column classes after - if that's what you mean. Do you need to have factor levels set before you've actually get those values in? — nycrefugee, Mar 16 '19 at 17:09
useful post explaining the basic concept: https://www.datamentor.io/r-programming/factor/ — RK1, Mar 16 '19 at 18:25

Omar113 · Answer 1 · 2019-03-16T17:27:19.780

If you are coming from statistical background you can think of factor as categorical variable. In R, a factor is a categorical variable which can contain many levels. Levels are the numbers of distinct values for this variable.

Let's load a data frame to examine that.

data("PlantGrowth")
head(PlantGrowth)
#you can see here output of categorical column called 'group'
#
str(PlantGrowth)
#by typing fuction str(), it will till you that this column is a factor which has 3 levels ("ctrl", "trt1" , "trt2")
#

Output

head(PlantGrowth)

  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

str(PlantGrowth)

'data.frame':   30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

Your trial is not going to work because all what you do is a definition of the distinct values of variable. so if you try str(df.empty) you will get the levels displayed!

> str(df.empty)
'data.frame':   0 obs. of  3 variables:
 $ column1: num 
 $ column2: Factor w/ 0 levels: 
 $ column3: Factor w/ 3 levels "A","B","C"

Lastly if you want to combine a row to a dataframe you would use rbind()

 newRow <- c(-2,"MyString","B") 
 incorrectRow <- c(-2,"MyString","C")

rbind(df.empty, newRow)
  X..2. X.MyString. X.B.
1    -2    MyString    B

rbind(df.empty, incorrectRow)
  X..2. X.MyString. X.C.
1    -2    MyString    C

Both of them should work correctly with you!

Thank you very much! I tried your code as you listed above, I don't see how it is different from the original code (sorry, I am new at this). — independent1019, Mar 16 '19 at 18:01
No changes. I just clarified everything to show you what works! — Omar113, Mar 16 '19 at 18:31

Santiago Capobianco · Answer 2 · 2019-03-16T18:19:18.157

In order to preserve the classes of the defined variables you must do two things:

1) Set stringsAsFactors = FALSE, so the character variable doesnt become a factor.

2) New row must be a list.

Like in this example:

> df.empty <- data.frame(column1 = numeric(), column2 = character(),
+                        column3 = factor(levels=c("A","B","C")), stringsAsFactors = FALSE)
> 
> newRow <- list(-2, "MyString","B")
> incorrectRow <- list(-2, "MyString", "C")
> 
> # Not mess columns names
> 
> df.empty[nrow(df.empty) + 1,] <- newRow
> df.empty[nrow(df.empty) + 1,] <- incorrectRow
> 
> df.empty
  column1  column2 column3
1      -2 MyString       B
2      -2 MyString       C
> summary(df.empty)
    column1     column2          column3
 Min.   :-2   Length:2           A:0    
 1st Qu.:-2   Class :character   B:1    
 Median :-2   Mode  :character   C:1    
 Mean   :-2                             
 3rd Qu.:-2                             
 Max.   :-2

For preserving the columns names, the credit goes to this anwser: https://stackoverflow.com/a/15718454/8382633

My first attemp was also with rbind, but it has some drawbacks. It doesnt preserve columns names, an also, convert all strings to factors, or if you set stringsAsFactors = FALSE, all factors to strings!!

> df.empty <- rbind.data.frame(df.empty, newRow, incorrectRow)
> 
> summary(df.empty)
   c..2...2.  c..MyString....MyString.. c..B....C..
 Min.   :-2   MyString:2                B:1        
 1st Qu.:-2                             C:1        
 Median :-2                                        
 Mean   :-2                                        
 3rd Qu.:-2                                        
 Max.   :-2                                        
> class(df.empty$c..MyString....MyString..)
[1] "factor"

or with stringsAsFactors = FALSE:

> df.empty <- rbind.data.frame(df.empty, newRow, incorrectRow, stringsAsFactors = FALSE)
> 
> summary(df.empty)
   c..2...2.  c..MyString....MyString.. c..B....C..       
 Min.   :-2   Length:2                  Length:2          
 1st Qu.:-2   Class :character          Class :character  
 Median :-2   Mode  :character          Mode  :character  
 Mean   :-2                                               
 3rd Qu.:-2                                               
 Max.   :-2                                               
> 
> class(df.empty$c..B....C..)
[1] "character"

I was thinking it was close to a duplicate. But in the end, this questions opened more questions to me.

Hope it helps.

Yes this does! So the (levels=c("A","B","C")) means that column 3 will only take values "A","B", or "C", right? — independent1019, Mar 16 '19 at 18:02

Working with Dataframes within R, what is level and factor

2 Answers2