I have a big dataset of fire occurring in forests, and I want to predict when the fire ignites. This happens very rarely: 290 times out of 620 000 times.
A tibble: 62,905 x 13
amplitude polarity DEM_avg DC DMC DSR FFMC Pd RH TEMP WS tree_cover fire
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 -37.8 0 165. 269. 21.9 0.607 84.0 0 65.1 290. 4.36 8 0
2 -68.1 0 303. 168. 44.5 1.41 89.9 0 46.6 296. 0.692 34.7 0
3 -54.3 0 332. 168. 44.5 1.41 89.9 0 46.6 296. 0.692 35.8 1
4 -108. 0 338. 168. 44.5 1.41 89.9 0 46.6 296. 0.692 30.3 0
5 -60.3 0 374. 171. 35.7 2.30 88.9 0.3 51.7 295. 4.01 29.6 1
6 -82.8 0 48.2 133. 18.4 0.210 84.9 0 65.1 289. 1.35 18.7 0
7 -99.6 0 299. 219. 42.6 2.09 90.8 0 34.2 297. 1.42 7 1
8 -98.1 0 116. 153. 44.7 0.988 89.0 0 41.3 298. 0.235 32.6 0
I tried to use SMOTE to balance my highly imbalanced dataset with the changes suggested by StupidWolf. I do the following:
library(readr)
library(tidyverse)
library(caret)
library(DMwR)
data <- read_csv("data/fire2018.csv",
col_types = cols(fire = col_factor(levels = c("0",
"1"))))
training.samples <- data$fire %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
SMOTE(fire ~ amplitude + polarity_dummy + DEM_avg + DC + DMC + DSR + FFMC + Pd + RH + T + VPD + WS + tree_cover, data = data.frame(train.data), perc.over = 600, perc.under = 100)
However, when I use SMOTE from the DMwR package I now get the following error:
Error in factor(newCases[, a], levels = 1:nlevels(data[, a]), labels = levels(data[, :
invalid 'labels'; length 0 should be 1 or 2
In addition: Warning messages:
1: In if (class(data[, col]) %in% c("factor", "character")) { :
the condition has length > 1 and only the first element will be used
2: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
NAs introduced by coercion
3: In smote.exs(data[minExs, ], ncol(data), perc.over, k) :
NAs introduced by coercion
I have looked for different solutions. One suggested transforming variables into numeric and factor, but my variables are already transformed correctly. My dependent variable is factor w/ 2 levels and the independent variables are numeric, and I have no N/A in any of my variables. But, that did not help my case. I got a similar error.