2

I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE I still get this error in some cells.

Here is how I import the data:

raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))

The taxon_import function is used to split the taxa and assign variable names:

taxon_import <- function(data) {
  data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
  colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
  return(data)
}

Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In[<-.factor(tmp, thisvar, value = "Undefined") : invalid factor level, NA generated

Here follows the data_cleanup function:

data_cleanup <- function(data) {
  strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
  for (i in strip_1) {
    data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
  }
  data[data==""] <- "Undefined"
  strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
  for (j in strip_2) {
    data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
  }
  return(data)
}

The function is simply applied like: test <- data_cleanup(raw_data_1)

I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn

I hope you will forgive my ignorance, however I tried many solutions before posting here.

Roelof Coertze
  • 586
  • 3
  • 15

1 Answers1

1

We start by using the tidyverse library. Let me give a twist to your question, as it's about replacing NAs, but I think with this code you should avoid that problem.

As I read your code, you erase the strings "D_0__", "D_1__", ... from the observation strings. Then you replace the strings "Ambiguous_taxa", "unidentified", ... with the string "Undefined".

According to your data, I replaced the functions with regex, which makes a little easy to clean your data:

library(tidyverse)
taxon_import <- function(data) { 
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
raw_data_1 <- taxon_import(read.delim("taxonomy.tsv", stringsAsFactors = FALSE))
raw_data_1 <- data.frame(lapply(raw_data_1,as.character),stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(raw_data_1,function(x) sub("^D_[0-6]__","",x)), stringAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("__|unidentified|Ambiguous_taxa|uncultured","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("Unknown|uncultured\\s.\\*|Unassigned\\s.\\*","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("wastewater\\sUnassigned|metagenome","Undefined",x)), stringsAsFactors = FALSE)
depured[depured ==""] <- "Undefined"

Let me explain my code. First, I read in many websites that it's better to avoid loops, as "for". So how you replace text that starts with "D_0__"?

The answer is regex (regular expression). It seems complicated at first but with practice it'll be helpful. See this expression:

"^D_[0-6]__"

It means: "Take the start of the string which begins with "D_" and follows a number between 0 and 6 and follows "__"

Aha. So you can use the function sub

sub("^D_[0-6]__","",string)

which reads: replace the regular expression with a blank space "" in the string.

Now you see another regex:

"__|unidentified|Ambiguous_taxa|uncultured"

It means: select the string "__" or "unidentified" or "Ambiguous_taxa" ...

Be careful with this regex

"Unknown|uncultured\\s.\\*|Unassigned\\s.\\*"    

it means: select the string "Unknown" or "uncultured .*" or...

the blank space it's represented by \s and the asterisk is \*

Now what about the as.data.frame function? Every time I use it I have to make it "stringsAsFactors = FALSE" because the function tries to use the characters, as factors.

With this code no NA are created.

Hope it helps, please don't hesitate to ask if needed.

Regards,

Alexis

Alexis
  • 2,104
  • 2
  • 19
  • 40