1

I have tab delimited text files. Each file has three columns -ProbeID, Avgsignalintenities, Pvalue. Before further analysis, I want to ensure that the data in the ProbeID column are correct. The ProbeID column in each file contains approximately 47,315 values and so I'm concerned about performance. I've included a screen shot of a single file opened in Excel. Valid files should have only 47,234 ProbeIDs.

If you want more information I can provide you immediately.

I have given the minimal information in r code.I have 4 files in which file1 is length 10 while the others are 7,I want pass all these files together into a function and check whether all of them are same length or not..if not It should return a message that the a particular file(ie file 1) is not of equal length

file1=list(ProbeID=c(360450,1690139,5420594,3060411,450341,5420324,730162,4200739,1090156,7050341),X1234Avgintensity=c(110.3703,469.5097,407.557,123.9965   ,2234.529,190.7429,110.072,314.7892,153.486,160.4385),X1234Pvalue=c(0.8424522,0.01054713,0.01450231,0.5800923,0,0.1437047,0.8477257,0.02900461,0.286091,0.2406065))

file2=list(ProbeID=c(360450,1690139,5420594,3060411,450341,5420324,730162),X3456Avgintensity=c(110.3703,469.5097,407.557,123.9965,2234.529,190.7429,110.072),X3456Pvalue=c(0.8424522,0.01054713,0.01450231,0.5800923,0,0.1437047,0.8477257))

file3=list(ProbeID=c(360450,1690139,5420594,3060411,450341,5420324,730162),X678Avgintensity=c(66.78696,160.4022,207.996,80.48443,1187.988,91.58123,85.80681),X678Pvalue=c(0.9538563,0.02768622,0.01450231,0.6031641,0,0.313118,0.444298))

file4=list(ProbeID=c(360450,1690139,5420594,3060411,450341,5420324,730162),X8701Avgintensity=c(83.57081,141.5529,238.9153,98.10896,1060.654,97.65002,83.88175),X8701Pvalue=c(0.814766,0.03493738,0.005273566,0.3651945,0,0.3750824,0.808174))
joran
  • 169,992
  • 32
  • 429
  • 468
Dinesh
  • 643
  • 5
  • 16
  • 31

3 Answers3

2

I don't think that 47,315 rows is particular large. So here is how I would do it:

  1. Find a file that you are happy with that contains the correct number of rows. Read in this file and call is f1
  2. Now loop through the remaining files and compare the probeID column with the correct column in f1. Make a note of the files that are valid. When looping through the files, here are a few tips:
    • Keep overwriting the comparison file, i.e. don't have data sets f3, f4, f5. At any one time you should just have f1 and a single comparison data set. This will save memory.
    • In the read.csv function, look at the colClasses argument. Looking at your example data set, something like colClasses=c("numeric", "numeric", "numeric") should work. This will make it quicker when reading in data.

Update

Following the edit to your question, you seem to be interested in the number of lines a particular file has, so here is some pseudo-code to help you:

fnames = list.files()
no_of_lines = numeric(length(fnames))
for(i in seq_along(fnames) {
    d = read.delim(fnames[i])
    no_of_lines[i] = dim(d)[1]
}

You can then use plot or table on no_of_lines

csgillespie
  • 59,189
  • 14
  • 150
  • 185
  • Actually there is no definite correct number of rows.for an example IF I Select 5 files all the 5 files should contain 47294 rows,if one of them contains 47314 then it should be print that this particular file rows does not match with other. – Dinesh Sep 15 '11 at 14:21
  • :I have redited the question with minimal information R code for the better understanding..Plese do help me – Dinesh Sep 15 '11 at 15:08
  • @csgillespie-ur code works fine and it gives the number of lines in my file.I want to compare those values within that function so that if it matched it should print "matches" else("unmatch between file1 and file 2")..I tried by writing one function but it works partially correct(ie when an file with unequal number of lines among others comes in the middle of selected files(3rd file among 7 files selected),it prints that files have same number of lines(actually it is not) but when the same file is selected as first file then it shows files mismatch. – Dinesh Sep 15 '11 at 23:04
1

Like Colin said, it doesn't sound like your data files are very big. Use system.time or one of the profiling packages to see how long it takes to read in each file will read.delim. If it really does take too long, then see this question for how to go faster.

Quickly reading very large tables as dataframes in R

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
0

You read tab-delimited files with read.delim(), it is identical to read.table() and read.csv() except for the defaults, which are setup to take \t as the seperator.

For example,

my.data <- read.delim('c:/path/to/my/file.txt')

Once you have the data in, you can count the number of rows using

nrow(my.data)

If checking validity is simply checking that the number of rows is 47,234 then you could do something like this

if(nrow(my.data) == 47234L) {
  do.something()
} else {
  do.something.else()
}

You may, however, be wanting to check distinct ProbeIDs, so you could do this instead

length(unique(my.data$ProbeID)) == 47234L

But, if you need to check that a certain list of 47,234 ProbeIDs is present, you will have to have that list somewhere loaded or defined already to check against it. See @csgillespie's answer because I think that is where he was going.

As for performance, if you can load it in Excel, you can load it in R faster.

adamleerich
  • 5,741
  • 2
  • 18
  • 20
  • :I have re-edited the my posy with minimal reproducible code so that it will make much easier in understanding my problem..please do help me – Dinesh Sep 15 '11 at 15:06