1

I know that this question has been asked before but I can not get it to work for me and I swear I tried many ways do do it from for file in loops to lapply. I have tables in which I want to replace the columns 2 to 7 'S headers which are now in this format: "X1","X2","X3","X4","X5","X6","X7" into "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species".

Each table does not have the same amount of row nor column.

My 31 tables are listed as this:

step4 <- list.files(pattern="*.coldrop.tsv")

Also, and this is a ''sub-problem'', I am doing it from the 2nd column because RAM keeps adding row numbers (1,2,3,4,5,6....n). If anyone can help me here that would be great.. I need to do it on all these ''step4'' list of tables. here are some ''samples'' of what I want to do.

when I fisrt was trying I opted for the for file in loop option:

colnames <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

The following works on a single file

names(Omlo_run11_table.tsv.step1.tsv.step2.tsv.step3.tsv.coldrop.tsv)[2:8] <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

i = 1
for(i in 1:length(step4)){
  names(step4[i])[2:8] <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species") 

}

I get this: Error in names(step4[i])[2:8] <- c("Kingdom", "Phylum", "Class", "Order", : 'names' attribute [8] must be the same length as the vector [1]

names(get(step4[i]))[names(get(step4[i])) == "X1","X2","X3","X4","X5","X6","X7"] <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species")

I get this: Error in names(get(step4[i]))[names(get(step4[i])) == "X1", "X2", "X3", : incorrect number of subscripts

for(i in 1:length(step4)){
  nm <- paste0("step4[i]")
  tmp <- get(nm)
  colnames(tmp)[2:8] <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
  assign(nm, tmp)
}

I get this: Error in get(nm) : object 'step4[i]' not found

lapply (step4, function(df) { colnames(df)[2:length(step4)] <-colnames[1:length(step4)]-1)}

and so on... I am more of a for file in type of person but I am open to lapply options. I encountered solutions with setnames but could not figure it out either.. Can please someone help me...

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    It looks like `step4` is a character vector of file names that have not been read into R. (Unless you omit code that reads the files in and assigns the list of files to the same object.) Character vectors don't have column names - you have to read them in as data frames first. – Gregor Thomas Aug 09 '16 at 19:32
  • Also, please don't use the `rstudio` tag unless your question is about the code editor RStudio (if you had a grammar question for an email you are writing, you wouldn't use a `gmail` tag). – Gregor Thomas Aug 09 '16 at 19:33
  • Hi, I used this: step4 = list.files(pattern="*.coldrop.tsv") for (i in 1:length(step4)) assign(step4[i], read.csv(step4[i], sep="\t", quote="", header=TRUE, as.is=FALSE)). Sorry about Rstudio! – Émilie Tremblay Aug 09 '16 at 19:59
  • You shouldn't be using `assign`, it makes things messy and difficult. Instead [use a list of data frames](http://stackoverflow.com/q/17499013/903061). – Gregor Thomas Aug 09 '16 at 20:03
  • oh, good to know. I am a newbie to the R language so any advice may help. The reason why I avopided the dataframe is because I do not know the number of rows and columns for each table and it does change among them. Though, I know that the first columns (1-8) are always the same...To me it seems like an issue as you seem to have to give ''sizes'' of the table in the dataframe command, or am I completely misunderstanding it.? – Émilie Tremblay Aug 09 '16 at 20:09
  • You are misunderstanding. See the link in the above comment. Please read the whole thing - it's very relevant to your question. You are creating data frames without specifying size using `read.csv`, you're just making it overly complicated by uusing `assign`. Just do `data_list = lapply(step4, read.table, quote = "", header = T, as.is. = FALSE)`. – Gregor Thomas Aug 09 '16 at 20:16

1 Answers1

0

Simply create a list of dataframes using your step4 character vector as @Gregor comments. Then, rename columns of each df iteratively which can all be handled in one lapply()anonymous function. Also, since you are working with tab separated files, you want the generalized read.table() function (of which read.csv is a special wrapper for comma separated files):

step4 <- list.files(path = tsvfilepath, pattern=".*tsv$", full.names = TRUE)

dfList <- lapply(step4, function(i) {
        df <- read.table(i, sep="\t", quote="", header=TRUE, as.is=FALSE)
        names(df)[2:8] <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species") 
        return(df)
})

TSV Files Import with Colnames


This list becomes useful for various needs such as for individual dataframes or one master dataframe.

For individual dfs, consider setNames() to name each individually and list2env() to create separate environment objects. Below gives each df the same name as their corresponding file name:

dfList <- setNames(dfList, step4)

list2env(dfList, envir=.GlobalEnv)

For one large master df, where you append all dataframes together, you have the challenge of the incomplete columns. Hence, consider third-party packages to fill in for missing columns across dfs:

library(plyr)
rbind.fill(dfList)

library(dplyr)
bind_rows(dfList)

library(data.table)    
rbindlist(dfList, fill=TRUE)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Hi, I ran the command-line as you gave me and there were no warning or error meassage. Though, the column headers remain the same (X1, X2, ...) and so they were not renamed.... Why – Émilie Tremblay Aug 10 '16 at 14:50
  • What command line? Did you run the `lapply()` function? If it was not clear, the lines at end requires the lapply which produces dfList. – Parfait Aug 10 '16 at 16:46
  • This hole thing: step4 <- list.files(pattern="*.coldrop.tsv") dfList <- lapply(step4, function(i) { df <- read.table(i, sep="\t", quote="", header=TRUE, as.is=FALSE) names(df)[1:7] <- c("Kingdom","Phylum","Class","Order","Family","Genus","Species") return(df) }) – Émilie Tremblay Aug 10 '16 at 18:21
  • Interesting as it works perfectly on my test end. I think you are not pulling in any files. Is `step4` empty? What is your current working directory? Check with `getwd()`. This is the default path in `list.files()`. You can set it with `setwd()` or use list.files' `path` arg. If not, is dfList empty? You might not have true tab delimiters, try changing `sep="\t"` to just whitespace, `sep=""`. – Parfait Aug 10 '16 at 20:34
  • Ok. I am more confused now; I ran this temp = list.files(path = workingPath, pattern="*_table.tsv") tempp <- lapply(temp, read.table(i, sep="\t", quote="", header=TRUE, as.is=FALSE)) and I get this:Error in read.table(i, sep = "", quote = "", header = TRUE, as.is = FALSE) : 'file' must be a character string or connection. – Émilie Tremblay Aug 11 '16 at 14:14
  • You forgot to pass the function call which defines **i**: `lapply(temp, function(i) read.table(i, ...)`. Please follow the answer I posted. I assure you it works. – Parfait Aug 11 '16 at 17:21
  • Still having issues? I really want to help and reach resolution here. Check every input object: `temp`, actual .tsv files, etc. Try reading in one file to help debug. – Parfait Aug 12 '16 at 17:07
  • I did not figure it out yet. I really appreciate your help. I did check everything(well, I think so..). – Émilie Tremblay Aug 15 '16 at 14:44
  • Oh and the command works for one file a the time it is just for the list of them that it causes problems. – Émilie Tremblay Aug 15 '16 at 14:45
  • whether I use a for file loop or lapply it is not working. cln = list.files(pattern="*.cln.tsv") for (i in 1:length(cln)) assign(cln[i], read.OTU)) Error in read.table(temp[i], sep = "\t") : 'file' must be a character string or connection table <- grep("table", value=TRUE, ls(all.names=TRUE)) lapply(table, tax.fill(table, downstream=TRUE)) Error in valid.OTU : the given object for otu1 is not a data frame. – Émilie Tremblay Aug 15 '16 at 14:51
  • Can you post one of the .tsv files in your question? Or better, Dropbox/Google Drive/OneDrive a link? I will then copy it several times to test my `lapply()` solution. – Parfait Aug 15 '16 at 15:59
  • Hi sorry I was sick for 2 days. HEre is the link: https://drive.google.com/open?id=0B8zzlTlIlR2BTUZnRzM5TFpkdjg – Émilie Tremblay Aug 17 '16 at 12:35
  • Code here works perfectly on those files. See screenshots both of which are retained in `dfList`. I simply changed the `step4` list to be defined on path and *.tsv* pattern. And the colnames did need range 2:8 instead of 1:7. – Parfait Aug 17 '16 at 17:38
  • Ok, I ran the code. Does your output table becomes your input table? I have no change in mine still..... – Émilie Tremblay Aug 18 '16 at 13:48
  • I am not understanding your question. The result is a list `dfList()` of two dataframes which correspond to the two .tsv files from your Google Drive. Are you getting any errors? If using RStudio, please check all environ. objects -step4, dfList...are they coming up empty? The screenshots are the views in RStudio: `View(dfList[[1]])` and `View(dfList[[2]])`. – Parfait Aug 18 '16 at 13:54
  • Ok they are lists! I kept expecting Rstudio to modify my tsv file (the input one). But then, if it is a list, can I keep applying functions on that list through lapply? Also, how will I get these dfList back into separate tables after? I was expecting an output file for each input and/or that the input became the modified file. – Émilie Tremblay Aug 18 '16 at 14:21
  • I thought you were confused about the **list** of dataframes! Code does not touch external files as we use no `write.table()` command. Everything is handled in R memory. Absolutely you can add operations. Remember `lapply()` is a loop, just use its temp `df` object, for instance `write.table(df, i, sep="\t")` outputs each df to file (overwriting original unless you change `i` name). As for multiple separate dfs in your R environ (not external files) see my latter note on using `list2env()`. – Parfait Aug 18 '16 at 14:47
  • ok and one last thing here, the ''i'' confuses me as it seems to be an anonymus function and/or a value? – Émilie Tremblay Aug 18 '16 at 15:51
  • `i` is the loop variable, so corresponds to each element of step4 (string literal name of each .tsv file). It is similar to the i in: `for i in step4`. – Parfait Aug 18 '16 at 15:53
  • Dear Parfait, if I want to run lapply on a list of files that are in my environment, how shall I list them properly? Thanks – Émilie Tremblay Aug 18 '16 at 19:00
  • Not understanding. This entire answer runs lapply on list of files (step4). – Parfait Aug 18 '16 at 19:11
  • SOrry I shall have been more clear. Or maybe even start a new question... I want to to a similar command line but with different things. I want to use lapply to apply the tax.fill function to items that are in my environment (not tsv files) ; dfList <- setNames(dfList, table) list2env(dfList, envir=.GlobalEnv) dfList <- lapply(dfList, function(i) { df <- tax.fill(i, downstream=TRUE) return(df) }) so I was following the same idea but, guess what..it does not work for me again... Error in setNames(dfList, table) : object 'dfList' not found – Émilie Tremblay Aug 19 '16 at 12:00
  • dfList <- lapply(table, function(i) { df <- tax.fill(i, downstream=TRUE) return(df) }) Error in valid.OTU(data) : the given object for otu1 is not a data frame. – Émilie Tremblay Aug 19 '16 at 12:05
  • You might need to ask a new question as I am unfamiliar with the function or package. As for `list2env`, that should be used after (not in place of or before) building `dfList`. Also, it is unclear what `table` is as it should be character vector of .tsv file names if following this answer. – Parfait Aug 19 '16 at 17:32
  • the tax.fill function is very simple: output <- tax.fill(input, downstream=TRUE) – Émilie Tremblay Aug 19 '16 at 18:26