Welch's T-Test / ANOVA / Pearson Chi-Squared Test in R with unequal sample sizes

Question

Please, see 'Addendum 3'

I am trying to perform an ANOVA test in R to see whether there are differences among the voters of the 5 main political parties in the Spanish 2019 General Elections according to the variable 'age' (P20_range stands for different age intervals in my code).

My code is, as follows:

CIS_data_5 <- data.frame(
  CIS$RECUERDO,
  CIS$P20
)

CIS_data_5$CIS.RECUERDO <- sub("\\(NO LEER\\) ", "", CIS_data_5$CIS.RECUERDO)
RecuerdoDeVoto1 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Unidas Podemos"))
RecuerdoDeVoto2 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PSOE"))
RecuerdoDeVoto3 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Ciudadanos"))
RecuerdoDeVoto4 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PP"))
RecuerdoDeVoto5 <- subset(CIS_data_5, CIS.RECUERDO %in% c("VOX"))


P20 <- as.integer(as.character(CIS_data_5$CIS.P20))

P20labs <- c("16-29", "30-44", "45-64", ">65", "N.C.")
cut_points <- c(16, 30, 45, 65, Inf)

i <- findInterval(P20, cut_points)
P20_fac <- P20labs[i]
P20_fac[is.na(P20)] <- P20labs[length(P20labs)]
P20_fac <- factor(P20_fac, levels = P20labs)

CIS_data_5$CIS.P20 <- P20
CIS_data_5$P20_range <- P20_fac

P20_range <-as.vector(CIS_data_5$P20_range)

# Computing the Analysis of Variance
CIS_data_6 <- list(RecuerdoDeVoto1=RecuerdoDeVoto1,RecuerdoDeVoto2=RecuerdoDeVoto2,RecuerdoDeVoto3=RecuerdoDeVoto3, RecuerdoDeVoto4=RecuerdoDeVoto4,RecuerdoDeVoto5=RecuerdoDeVoto5)
 data.frame(RecuerdoDeVoto=unlist(CIS_data_6),
            P20_range=factor(rep(names(CIS_data_6),sapply(CIS_data_6,length))))
 
res.aov <- aov(RecuerdoDeVoto~P20_range, data = CIS_data_6)

# Summary of the Analysis
summary(res.aov)

However, I am not sure what I am doing wrong, since I looked up this question Attempting to create anova table with unequal sizes R and I have reproduced the code exactly (with, of course, the necessary modifications, so it fits my data), but I keep getting the following error:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 191, 623, 115, 387, 114

which of course corresponds to the differing amount of voters for each of the 5 main Spanish political parties (Unidas Podemos, PSOE, Ciudadanos, PP, and VOX).

I am not sure how I can override this problem within my code.

Thus, any help would be of enormous appreciation!

Many thanks in advance!

Addendum 1

It has been suggested to me that maybe I should try to perform a Pearson Chi-Squared Test for this particular problem that I am trying to analyse; but I am really not sure of whether I should root for an ANOVA or for a Pearson Chi-Squared Test in this case. Again, any comment on this is much welcome!

Addendum 2

I have tried to perform a Pearson Chi-Squared Test by running the following code:

CIS_data_5 <- data.frame(
  CIS$RECUERDO,
  CIS$P20
)

CIS_data_5$CIS.RECUERDO <- sub("\\(NO LEER\\) ", "", CIS_data_5$CIS.RECUERDO)
RecuerdoDeVoto1 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Unidas Podemos"))
RecuerdoDeVoto2 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PSOE"))
RecuerdoDeVoto3 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Ciudadanos"))
RecuerdoDeVoto4 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PP"))
RecuerdoDeVoto5 <- subset(CIS_data_5, CIS.RECUERDO %in% c("VOX"))


P20 <- as.integer(as.character(CIS_data_5$CIS.P20))

P20labs <- c("16-29", "30-44", "45-64", ">65", "N.C.")
cut_points <- c(16, 30, 45, 65, Inf)

i <- findInterval(P20, cut_points)
P20_fac <- P20labs[i]
P20_fac[is.na(P20)] <- P20labs[length(P20labs)]
P20_fac <- factor(P20_fac, levels = P20labs)

CIS_data_5$CIS.P20 <- P20
CIS_data_5$P20_range <- P20_fac

P20_range <-as.vector(CIS_data_5$P20_range)

RecuerdoDeVoto <- c(RecuerdoDeVoto1, RecuerdoDeVoto2, RecuerdoDeVoto3, RecuerdoDeVoto4, RecuerdoDeVoto5)
IntervalosDeEdad <- rep(P20_range, length(RecuerdoDeVoto1), length(RecuerdoDeVoto2), length(RecuerdoDeVoto3), length(RecuerdoDeVoto4), length(RecuerdoDeVoto5))
chisq.test(RecuerdoDeVoto, IntervalosDeEdad)

And I get the following error:

Error in chisq.test(RecuerdoDeVoto, IntervalosDeEdad) : 
  'x' and 'y' must have the same length

Addendum 3

After much research, I've found that the best way to go is to perform a Welch's T-Test, since I am dealing with 2 samples of different size, hence different variances. However, I am not sure on how to perform it in R.

Any help is much welcome!

you can see this link maybe solve your problem https://stackoverflow.com/questions/26147558/what-does-the-error-arguments-imply-differing-number-of-rows-x-y-mean — rezgar, Oct 09 '22 at 20:02
Thank you, @rezgar! I've checked the answers, and I saw that one of them says "This just in case somebody else needs the output to be a data frame.". However, if you see my code, the data is already ```data.frame```, so the problem must lie somewhere else. Thank you in any case! — ArtUr693, Oct 09 '22 at 20:09

Hannah · Answer 1 · 2023-03-06T16:10:12.050

To perform an ANOVA on samples of unequal variances, there is a correction applied to the degrees of freedom and thus the F-statistic. The test with this correction is referred to as Welch's ANOVA.

For your question, having equal sample sizes is not an assumption of the classic ANOVA with three or more samples. So you can but do not have to apply Welch's correction unless the assumption of equal variances is not met.

That being said, it is common practice to apply Welch's correction to a t-test with two samples if the sample sizes are not equal.

https://www.statisticshowto.com/welchs-anova/

For three or more samples of unequal variances, use oneway.test() for a Welch's ANOVA in R:

oneway.test(RecuerdoDeVoto~P20_range, data = CIS_data_6)

For two samples of unequal variances or unequal sizes, use t.test() with the var.equal boolean set to 'FALSE' for a Welch's t-test in R:

t.test(RecuerdoDeVoto~P20_range, data = CIS_data_6, var.equal = F)

score 0 · Answer 2 · answered Oct 09 '22 at 22:01

I solved the question by using the following code:

CIS_data_5 <- data.frame(
  CIS$RECUERDO,
  CIS$P20
)

CIS_data_5$CIS.RECUERDO <- sub("\\(NO LEER\\) ", "", CIS_data_5$CIS.RECUERDO)
RecuerdoDeVoto1 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Unidas Podemos"))
RecuerdoDeVoto2 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PSOE"))
RecuerdoDeVoto3 <- subset(CIS_data_5, CIS.RECUERDO %in% c("Ciudadanos"))
RecuerdoDeVoto4 <- subset(CIS_data_5, CIS.RECUERDO %in% c("PP"))
RecuerdoDeVoto5 <- subset(CIS_data_5, CIS.RECUERDO %in% c("VOX"))


P20 <- as.integer(as.character(CIS_data_5$CIS.P20))

P20labs <- c("16-29", "30-44", "45-64", ">65", "N.C.")
cut_points <- c(16, 30, 45, 65, Inf)

i <- findInterval(P20, cut_points)
P20_fac <- P20labs[i]
P20_fac[is.na(P20)] <- P20labs[length(P20labs)]
P20_fac <- factor(P20_fac, levels = P20labs)

CIS_data_5$CIS.P20 <- P20
CIS_data_5$P20_range <- P20_fac

IntervalosDeEdad <-as.numeric(CIS_data_5$P20_range)

RecuerdoDeVoto <- as.numeric(c(RecuerdoDeVoto1$CIS.P20, RecuerdoDeVoto2$CIS.P20, RecuerdoDeVoto3$CIS.P20, RecuerdoDeVoto4$CIS.P20, RecuerdoDeVoto5$CIS.P20))

t.test(RecuerdoDeVoto, IntervalosDeEdad, var.equal = FALSE)

The obtained P-Value is < 2.2e-16 ; conclusions are self-evident.

Welch's T-Test / ANOVA / Pearson Chi-Squared Test in R with unequal sample sizes

2 Answers2