I have a given dataset with 1000 observations and 10 columns. Data can be found here. Every column represents a categorical variable with differing numbers of factor levels and are defined as such. Data is imported from csv like this:
raw_data <-read.csv2(directory+name.csv,colClasses=c(rep('factor',10)),na.strings=c(""))
R correctly imports the data and using str(raw_data)
shows how many factor levels each variable has. Perfect so far.
I then take a random sample of 100 obervations from this data and save them to a new dataframe.
comp_data <- raw_data [sample(1:nrow(raw_data), 100), ]
The problem that now arises is, that while sampling it may happen that not a single observation with a specific factorlevel has been drawn.
Let´s say Education has 5 factor levels in the complete dataset (raw_data
). Only 76 of the 1000 individuals have a "Partial Highschool" as Education. So it´s reasonable to assume, that sometimes during sampling not a single of those 76 observation will get drawn. In the sampled dataset (comp_data
) there are now only 4 levels present in the data. But str(comp_data)
shows that there are still 5 factorlevels. The number of factors has not been updated even though in reality there now only 4 levels and not 5.
I need a way to automatically update the number of factorlevels after sampling.
The issue later on is that I calculate Contingency Coefficients and CRAMER´s V between pairs of variables. Those functions give back an NA when the above described happens.
Thanks for the help. Kevin