1

I have a given dataset with 1000 observations and 10 columns. Data can be found here. Every column represents a categorical variable with differing numbers of factor levels and are defined as such. Data is imported from csv like this:

raw_data <-read.csv2(directory+name.csv,colClasses=c(rep('factor',10)),na.strings=c(""))

R correctly imports the data and using str(raw_data) shows how many factor levels each variable has. Perfect so far.

enter image description here

I then take a random sample of 100 obervations from this data and save them to a new dataframe.

comp_data <- raw_data [sample(1:nrow(raw_data), 100), ]

The problem that now arises is, that while sampling it may happen that not a single observation with a specific factorlevel has been drawn. Let´s say Education has 5 factor levels in the complete dataset (raw_data). Only 76 of the 1000 individuals have a "Partial Highschool" as Education. So it´s reasonable to assume, that sometimes during sampling not a single of those 76 observation will get drawn. In the sampled dataset (comp_data) there are now only 4 levels present in the data. But str(comp_data) shows that there are still 5 factorlevels. The number of factors has not been updated even though in reality there now only 4 levels and not 5.

I need a way to automatically update the number of factorlevels after sampling.

The issue later on is that I calculate Contingency Coefficients and CRAMER´s V between pairs of variables. Those functions give back an NA when the above described happens.

Thanks for the help. Kevin

Kevin
  • 47
  • 6

2 Answers2

1

The problem is, that R can't know which levels the factors are supposed to have without providing the information. To solve this, create a levels list with the element names corresponding to the variable names of the data. You can apply it to each version of the data to be downloaded.

lev.lst <- list(Marital.Status=c("Married", "Single", "Divorced", 'Widowed'),
                Gender=c("Female", "Male"), 
                Children=as.character(1:10), 
                Education=c("Partial High School", "High School", "Partial College", "Bachelors", "Masters", "Graduate Degree"), 
                Occupation=c("Skilled Manual", "Clerical", "Professional", "Manual", "Management"), 
                Home.Owner=c("Yes", "No"), 
                Cars=as.character(1:5), 
                Commute.Distance=c("0-1 Miles", "1-2 Miles", "2-5 Miles", "5-10 Miles", "10+ Miles"), 
                Region=c("Europe", "Pacific", "North America", "Asia", "Africa"), 
                Purchased.Bike=c("No", "Yes"))

Next, use read.csv with colClasses of the factors as 'character'`.

dat <- read.csv('https://pastebin.com/raw/ut447XdE', sep=';', colClasses='character')

Now, to avoid a mess, create a vector with the names of the factor variables.

facs <- names(lev.lst)

Finally, in Map use the factor function on the facs of your data frame, with lev.lst containing the respective information for the levels= argument.

dat[facs] <- Map(factor, dat[facs], lev.lst)

Gives

str(dat)
# List of 10
#  $ Marital.Status  : Factor w/ 4 levels "Married","Single",..: 1 1 1 2 2 1 2 1 1 1 ...
#  $ Gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 2 2 2 ...
#  $ Children        : Factor w/ 10 levels "1","2","3","4",..: 1 3 5 NA NA 2 2 1 2 2 ...
#  $ Education       : Factor w/ 6 levels "Partial High School",..: 4 3 3 4 4 3 2 4 1 3 ...
#  $ Occupation      : Factor w/ 5 levels "Skilled Manual",..: 1 2 3 3 2 4 5 1 2 4 ...
#  $ Home.Owner      : Factor w/ 2 levels "Yes","No": 1 1 2 1 2 1 1 1 1 1 ...
#  $ Cars            : Factor w/ 5 levels "1","2","3","4",..: NA 1 2 1 NA NA 4 NA 2 1 ...
#  $ Commute.Distance: Factor w/ 5 levels "0-1 Miles","1-2 Miles",..: 1 1 3 4 1 2 1 1 4 1 ...
#  $ Region          : Factor w/ 5 levels "Europe","Pacific",..: 1 1 1 2 1 1 2 1 2 1 ...
#  $ Purchased.Bike  : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 2 1 2 ...
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

I´ve found an answer my self that is much shorter and less cumbersome. The droplevels() command will drop all unused levels from a dataframe. Application:

df <- droplevels(df)
Kevin
  • 47
  • 6
  • Welcome to Stackoverflow. The need to drop factor levels has been a regularly asked R question on SO as above duplicate answers show. Please [research](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users) before asking a question! Cheers! – Parfait May 22 '23 at 00:38