1

I have pretty large dataframe -- about 235K rows and I want to do multivariate regression:

model <- lm(var~., data=data)

but I get an error:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  NA/NaN/Inf в 'y'
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion

Neither na.omit, nor other methods of getting rid of NA's didn't help.

So I've tried to find NA by myself. I've split dataframe into two parts:

Second UPD

data1 <- data[1:(dim(data)[1]/2), ]
data2 <- data[(dim(data)[1]/2):(dim(data)[1]), ]

and I again get result for both lm and no errors from previous UPD section! NB: I've restarted RStudio.

First UPD

data1 <- data[1:(dim(data)[1]/2),]

and when I call lm instead of previous error I get next:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

To reach this error I reduced data from 235K to 14.5K. So, what is the problem now? Some of offcasted slices don't throw any errors.

Origin version

data1 <- data[1:(dim(data)[1]/2)]
data2 <- data[(dim(data)[1]/2):(dim(data)[1])]

and call lm for each of them:

model1 <- lm(var~., data=data1)
model2 <- lm(var~., data=data2)

and I reciece no errors! So, I suppose problem is in big size of dataframe. Is there any way to fix it?

  • You are subsetting by columns. So your are splitting the data "vertically". The amount of rows is the same in `data1` and `data2`. What is the amount of columns in the dataframe? – Sandwichnick Feb 10 '22 at 08:43
  • Try `lm(var~., data=data[complete.cases(data), ])`. – jay.sf Feb 10 '22 at 09:01
  • what does `str(data)` give? – rw2 Feb 10 '22 at 09:16
  • @Sandwichnick, there are 14 columns, and yeap, I've forgotten an comma in the end. So now I have another error, I've updated the question. – dragondangun Feb 10 '22 at 09:16
  • @jay.sf, nope, that doesn't work, the same error about NA's. – dragondangun Feb 10 '22 at 09:38
  • @rw2, ```{R} 'data.frame': 235143 obs. of 14 variables: $ var0: chr "0" "0" "0" "1" ... $ var1: chr "8" "11" "18" "10" ... ``` etc. – dragondangun Feb 10 '22 at 09:40
  • Follow this: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – jay.sf Feb 10 '22 at 09:44
  • @dragondangun It would help to see the full outputs in your question. It looks like your predictors are coded as "characters". Try converting them to numeric, using `as.numeric()` and then running the model. – rw2 Feb 10 '22 at 09:54
  • @rw2, thanks! I've used: ```data<-lapply(data, as.numeric)``` and that's works. You can write it as an answer and I'll mark it as a solution. – dragondangun Feb 10 '22 at 10:01

1 Answers1

1

From the outputs of str(data) it looks like some of your numeric predictors are coded as "characters".

Re-code them to numeric using as.numeric and see if that fixes the issue.

If it does you might want to check why they're coded as characters. Are there rogue punctuation or spaces in your data?

rw2
  • 1,549
  • 1
  • 11
  • 20