1

I read a text file in R, looks like below, with 1354896 rows and 5 colums.

I try read.table(), and read.delim() to upload the file, however the format of file after upload changes. It transforms everything into a single column.

OffsetY=0
GridCornerUL=258 182
GridCornerUR=8450 210
GridCornerLR=8419 8443
GridCornerLL=228 8414
Axis-invertX=0
AxisInvertY=0
swapXY=0
DatHeader=[19..65528]  PA-D 102 Full:CLS=8652 RWS=8652 XIN=1  YIN=1  VE=30        2.0 11/04/03 12:49:30 50205710  M10      HG-U133_Plus_2.1sq                  6
Algorithm=Percentile
AlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004;AlgVersion:6.0;FixedCellSize:TRUE;FullFeatureWidth:7;FullFeatureHeight:7;IgnoreOutliersInShiftRows:FALSE;FeatureExtraction:TRUE;PoolWidthExtenstion:2;PoolHeightExtension:2;UseSubgrids:FALSE;RandomizePixels:FALSE;ErrorBasis:StdvMean;StdMult:1.000000

[INTENSITY]
NumberCells=1354896
CellHeader=X    Y   MEAN    STDV    NPIXELS
  0   0 147.0   23.5     25
  1   0 10015.0 1276.7   25
  2   0 160.0   24.7     25
  3   0 9710.0  1159.8   25
  4   0 85.0    14.0     25
  5   0 171.0   21.0     25
  6   0 11648.0 1678.4   25
  7   0 163.0   30.7     25
  8   0 12044.0 1430.1   25
  9   0 169.0   25.7     25
 10   0 11646.0 1925.6   25
 11   0 176.0   30.7     25

After reading the format is changed as shown below.:

The pic of data after reading

  1. I want to retain the format of rows and colums
  2. I want to remove all the content before [intensity] like (offset, GridCornerUL, so on) shown in the first file.
Hashim
  • 307
  • 1
  • 5
  • 16
  • Are you looking for the `skip` argument to `read.table`? It will let you skip over the rows you don't want. Just tell it how many lines to skip. – Jota Apr 20 '15 at 08:06
  • @Frank, if i don't know the number of lines to skip, depending on the data, – Hashim Apr 20 '15 at 08:08
  • If it is an Affymetrix CEL Data File Format, you have `affy` package from Bioconductor. –  Apr 20 '15 at 08:30
  • @Pascal. U R right. That is an option, but before reading it by `affy` i wanted to do some experimentation. – Hashim Apr 20 '15 at 09:15

3 Answers3

3

You could trys:

txt <- readLines("file.txt")
df <- read.csv(text = txt[-(1:grep("NumberCells=\\d+", txt))], check.names = FALSE)
write.csv(df, tf <- tempfile(fileext = ".csv"), row.names = FALSE)

read.csv(tf, check.names = FALSE) # just to verify...
#    CellHeader=X    Y   MEAN    STDV    NPIXELS
# 1                    0   0 147.0   23.5     25
# 2                    1   0 10015.0 1276.7   25
# 3                    2   0 160.0   24.7     25
# 4                    3   0 9710.0  1159.8   25
# 5                    4   0 85.0    14.0     25
# 6                    5   0 171.0   21.0     25
# 7                    6   0 11648.0 1678.4   25
# 8                    7   0 163.0   30.7     25
# 9                    8   0 12044.0 1430.1   25
# 10                   9   0 169.0   25.7     25
# 11                  10   0 11646.0 1925.6   25
# 12                  11   0 176.0   30.7     25

This omits everything before and including NumberCells=1354896.

lukeA
  • 53,097
  • 5
  • 97
  • 100
2

If NumberCells= always appears immediately before the header row, then you can exploit this to tell you the number of lines to skip:

dat<-readLines("file.txt")
read.table(textConnection(dat), header=TRUE, skip=grep("NumberCells", dat))
#   CellHeader.X Y  MEAN   STDV NPIXELS
#1             0 0   147   23.5      25
#2             1 0 10015 1276.7      25
#3             2 0   160   24.7      25
#4             3 0  9710 1159.8      25
#5             4 0    85   14.0      25
#6             5 0   171   21.0      25
#7             6 0 11648 1678.4      25
#8             7 0   163   30.7      25
#9             8 0 12044 1430.1      25
#10            9 0   169   25.7      25
#11           10 0 11646 1925.6      25
#12           11 0   176   30.7      25

Edit

Because your files have a lot of rows, you may want to limit the number of lines that readLines reads in. To do this, you need to know the maximum number of lines before your header row. For instance, if you know your header row will always come within the first 200 lines of the file, you can do:

dat<-readLines("file.txt", n=200)
read.table("file.txt", header=TRUE, skip=grep("NumberCells", dat))
Jota
  • 17,281
  • 7
  • 63
  • 93
  • Thanks for your answer . I think its feasible for small files. I waited to get my file for some 7 min, but still process was going on. – Hashim Apr 20 '15 at 09:37
  • @Hashim You can add an argument to `readLines` to speed it up. E.g. `dat<-readLines("file.txt", n=100)`. You could used `n=100` if you know you will always need to skip less than 100 lines before you get to your header row. It's up to you to figure out what a reasonable value for `n` is. – Jota Apr 20 '15 at 09:41
  • Thanks for proving valuable information. First line is enough fast, but `read.table` that is second line of code is not sure to be speedy. – Hashim Apr 20 '15 at 09:44
  • @Hashim You may want to have a look at some of the other options for reading in lots of data. In particular [this answer](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/1820610#1820610) may be useful. The `fread` function in the `data.table` package is probably going to offer speed improvements, as well. – Jota Apr 20 '15 at 09:52
  • Thanks for your suggestion. It works well as per my question of keeping the format of rows and columns. Need to bump up speed. – Hashim Apr 20 '15 at 10:04
2

As you are using linux, another option would be to pipe the awk with read.table or fread

read.table(pipe("awk 'NR==1, /NumberCells/ {next}{print}' Hashim.txt"),
      header=TRUE, check.names=FALSE)
#    CellHeader=X Y  MEAN   STDV NPIXELS
#1             0 0   147   23.5      25
#2             1 0 10015 1276.7      25
#3             2 0   160   24.7      25
#4             3 0  9710 1159.8      25
#5             4 0    85   14.0      25
#6             5 0   171   21.0      25
#7             6 0 11648 1678.4      25
#8             7 0   163   30.7      25
#9             8 0 12044 1430.1      25
#10            9 0   169   25.7      25
#11           10 0 11646 1925.6      25
#12           11 0   176   30.7      25
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank U.. `Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1354898 did not have 5 elements` – Hashim Apr 20 '15 at 10:26
  • You may have to use `fill=TRUE` in the read.table . i.e. `read.table(pipe("awk 'NR==1, /NumberCells/ {next}{print}' Hashim.txt"), header=TRUE, check.names=FALSE, fill=TRUE)` – akrun Apr 20 '15 at 10:27
  • A rocket go :) . Thank U – Hashim Apr 20 '15 at 10:31
  • @Hashim It should be fast as we using `awk` to read the lines – akrun Apr 20 '15 at 10:32