I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE
I still get this error in some cells.
Here is how I import the data:
raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))
The taxon_import
function is used to split the taxa and assign variable names:
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In
[<-.factor(
tmp, thisvar, value = "Undefined") : invalid factor level, NA generated
Here follows the data_cleanup
function:
data_cleanup <- function(data) {
strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
for (i in strip_1) {
data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
}
data[data==""] <- "Undefined"
strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
for (j in strip_2) {
data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
}
return(data)
}
The function is simply applied like: test <- data_cleanup(raw_data_1)
I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn
I hope you will forgive my ignorance, however I tried many solutions before posting here.