1

I'm trying to implement a oversampling function in R, meaning to replicate samples (the rows in my data frame) and then add some noise only to the newly created samples. The problem why this answers do not work

(Add two data frames together based on matching column names, How to merge and sum two data frames)

is that I also have factor columns, which obviously can't be "noised". My idea would be to remove the non-numerical columns, add the noise and then add them back, but I don't know how. Any other idea to achieve this goal is welcome, too.

# generate the initial dataframe
library(tidyverse)

train_set <- data.frame(
  Numeric1 = runif(20, 0 , 1),
  Numeric2 = runif(20, 0 , 1),
  Numeric3 = runif(20, 0 , 1),
  Numeric4 = runif(20, 0 , 1),
  Numeric5 = runif(20, 0 , 1),
  Numeric6 = runif(20, 0 , 1),
  Numeric7 = runif(20, 0 , 1),
  Numeric8 = runif(20, 0 , 1),
  Numeric9 = runif(20, 0 , 1),
  Numeric10 = runif(20, 0 , 1),
  Factor_column = rep("Factor1", 20)
)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))

# replicate each row twice
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]

# identify the added rows
Rownames <- row.names(subset(train_oversampled, grepl("\\.", row.names(train_oversampled))))
nRows <- nrow(subset(train_oversampled, grepl("\\.", row.names(train_oversampled))))

# which columns are numeric?
Headers <- train_oversampled %>% select_if(is.numeric) %>% names(.)
nHeaders <- train_oversampled %>% select_if(is.numeric) %>% ncol(.)

# create a new dataframe with the values to add and the dimensions of the selection above
noise <- data.frame(
  replicate(nHeaders, rnorm(nRows, mean = 0, sd = 0.0005))
)
row.names(noise) <- Rownames
names(noise) <- Headers

# add noise to the oversampled data frame
# does not work due to factor column
bind_rows(
  train_oversampled %>% add_rownames(), 
  noise %>% add_rownames()
  ) %>%
  group_by(rowname) %>% 
  summarise_all(sum, na.rm = T)

Any ideas on how to add the values from noise to the corresponding rows and columns in train_oversampled?

crazysantaclaus
  • 613
  • 5
  • 19

1 Answers1

1

I think this gives you what you're after. I filter out the numeric columns when I'm binding rows, do the group by and summarise, and then bind the factor columns back in.

bind_rows(
    train_oversampled %>% select_if(is.numeric) %>% add_rownames(), 
    noise %>% add_rownames()
) %>%
    group_by(rowname) %>% 
    summarise_all(sum, na.rm = T) %>% 
    bind_cols(train_oversampled %>% select_if(is.factor))
meenaparam
  • 1,949
  • 2
  • 17
  • 29