1

I am interested in making my R script to work automatically for another set of parameters. For example:

           gene_name  start_x   end_y 
   file1 -> gene1      100       200
   file2->  gene2      150       270

and my script does trivial job, just for learning purposes. It should take the information about gene1 and find a sum, write into a file; then it should take information of the next gene2, find sum and write this into a new file and etc, and lets say I would like to keep files name according to the genes name:

   file_gene1.txt     # this file holds sum of start_x +end_y for gene1
   file_gene2.txt     # this file holds sum of start_x +end_y for gene2

etc for the rest of 700 genes (obviously manually its to much work to take file1, and write file name and plug in start and end values into already existing script )

I guess the idea is clear, I have never been doing this type of things, and I guess its very trivial, but i would appreciate if anyone can tell me the proper definition of this process so I can search and learn online how to do it.

P.S: I think in Python I would just make a list of genes and related x/y values, loop and select required info, but I still don't know how I would keep gene names as a file name automatically.


EDIT:

I have to supply the info about a gene location, therefore start and end, which is X and Y respectively.

x=100    # assign x to a value of a related gene
y=150    # assign y to a value of a related gene


a=tbl[which(tbl[,'middle']>=x & tbl[,'middle']<y),]   # for each new gene this info is changing accoringly

write.table( a, file= '   gene1.txt' )     # here I would need changing file name

my thoughts:

  1. may be I need to generate a file, which contains all 700 gene names and related X and Y values.
  2. then I read line one of this file and supply it into my script (in case of variable a, x and y)
  3. then my computation is over I write results into a file and keep a gene name, that was used to generate this results.

Is it more clear?

P.S.: I Google it by probably because I don't know the topic I cant find anything relevant, just give me the idea where I can search, I would like to learn this programming step anyway.

zx8754
  • 52,746
  • 12
  • 114
  • 209
K.Ivi
  • 111
  • 2
  • 9
  • Are you writing a single row as a file – akrun May 12 '16 at 06:51
  • Please provide a reproducible example of your starting data for the R-script to run on. Also provide a clear example of what the intended output supposed to be. – Adam Quek May 12 '16 at 07:01
  • the given above is just a simple example, my real script does a lots of computations and an output file for a single gene name may contain up to 100 000 rows and always 7 columns. Number of rows varies, but columns number is fixed for all 700 genes. – K.Ivi May 12 '16 at 07:03
  • Adam Quek, my real script is 300 lines, and it does the job perfectly. I would like just to be able to apply this script to a 700 different cases, meaning I run and get 700 files. I have never done it therefore its even difficult for me to correctly formulate (sorry for that). – K.Ivi May 12 '16 at 07:15

2 Answers2

1

I guess so you are looking for reading all the files present in a folder (Assuming all your gene files written in a single folder using your older script). In that case you can use something like:

directory <- "C://User//Downloads//R//data"
file <- list.files(directory, full.names = TRUE)

Then access filename using file[i] and do whatever needed (naming the file paste("gene", file[i], sep = "_") or reading it read.csv(file[i])).

zx8754
  • 52,746
  • 12
  • 114
  • 209
abhiieor
  • 3,132
  • 4
  • 30
  • 47
0

I would divide your problem in two parts. (Sample data for reproducible example provided below)

library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation

1st: Apply your functions to your data by gene

  output <- dt[ , .( f1 = sum(start_x, end_y),
                     f2 =  start_x - end_y ,
                     f3 =  start_x * end_y ,
                     f7 = start_x / end_y),
                by=.(gene)]

2nd: Split your data frame by gene and save it in separate files

  output[,fwrite(.SD,file=sprintf("%s.csv", unique(gene))),
         by=.(gene)]

Latter on, you can do bind the multiple files into one single data frame if you like:

# Get a List of all `.csv` files in your folder
  filenames <- list.files("C:/your/folder", pattern="*.csv", full.names=TRUE)

 # Load and bind all data sets
   data <- rbindlist(lapply(filenames,fread))

ps. note that fwrite is still in development version of data.table as of today (12 May 2016)

data for reproducible example:

dt <- data.table( gene = c('id1','id2','id3','id4','id5','id6','id7','id8','id9','id10'),
                  start_x  = c(1:10),
                  end_y = c(20:29) )
Arun
  • 116,683
  • 26
  • 284
  • 387
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • 1
    I think you really need to mention you're using devel version when you use functions from devel version. Same [here](http://stackoverflow.com/a/37174683/559784). – Arun May 12 '16 at 08:24