0

I have two files. Each file contains a certain # of rows and columns. The second column in each file contains a unique GENE ID that are identical in both files (they are just in different rows). So for example

File 1 could have the gene "LINC00273" in row 3 (in the second column) and File 2 could have "LINC00273" in row 4002 (in the second column).

What I want to do is I want to sort File 2 so that if the first GENE in row 1 of File 1 is "VAGAB" then the first GENE of row 1 of File 2 is "VAGAB" with its entire row intact (the row has coverage information so I need the numbers to be moved accordingly).

Example: File 1 [#] corresponds to the row number

[1] GENE100X VAGAB VAGAB 1.0 1.0 1.0 5.0 11198.0 
...
[5000] GENE900X ZZZZ ZZZZ 1.0 0 4.2 1.0 78.0

File 2

[1] GENE44X AAAA AAAA 1.0 1.0 1.0 1.0 1.0
...
[5000] GENE106X VAGAB VAGAB 5.0 5.0 5.0 1.0 55.0

The expected output for file 2 would be:

[1] GENE106X VAGAB VAGAB 5.0 5.0 5.0 1.0 55.0
...
[5000] GENE300X ZZZZ ZZZZ 88.0 88.0 1.0 1.0 1.0

All files are tab delimited. All the numbers in examples (row numbers and column numbers) are made up. The only column that never changes is columns 2 and 3 these are the identical ones in both files so I want to sort using these.

I dont want to create a third file if it can be helped. I simply want the rows in file 2 to be identical to the rows in file 1 and the way to do this is to match them by their "Gene" name found in column 2 of both files.

R solutions preferred as that is what i'm currently learning, but I will take ANY solutions in any language and then I can use that as a template to come up with my own R solution as practice later.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
System
  • 295
  • 1
  • 4
  • 13

2 Answers2

2

Let's set up some example data first:

file1 <- data.frame(id=1:3,name=letters[c(3,1,2)])
#  id name
#1  1    c
#2  2    a
#3  3    b

file2 <- file1[c(2,3,1),]
file2$id <- paste("two",file2$id,sep="")
#    id name
#2 two2    a
#3 two3    b
#1 two1    c

Then match the rows using, well match:

file2[match(file1$name,file2$name),]
#    id name
#1 two1    c
#2 two2    a
#3 two3    b

Which is essentially the same as merging all the file2 data against the matching rows of the file1 data.

merge(file1["name"], file2, by="name", sort=FALSE)
#  name   id
#1    c two1
#2    a two2
#3    b two3
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • I guess I was using the merge function incorrectly, I should have spent more time reading the documentation. This seemed to have worked just by looking at the first few lines of data. If I want to write out this data to a file, would I simply use a write.table command? – System Jan 21 '16 at 00:38
  • @System - there is some nuance to it - the `sort=FALSE` is a bit funny for certain cases so `match` might be ideal for you. You can write it out if you like using `write.table/csv` yep. – thelatemail Jan 21 '16 at 00:41
  • Everything looks good so far. Some of the column ID's got reordered but that's nothing that a quick awk fix can't take care of. I've accepted this as the answer! – System Jan 21 '16 at 00:45
1

Load both files into data frames (sample data used below).

file1 <- data.frame(id=1:3,name=letters[c(3,1,2)])
file2 <- file1[c(2,3,1),]

Then you can define a sort sequence with factors:

file1$name <- factor(file1$name, levels = file1$name)
file2$name <- factor(file2$name, levels = file1$name)

Then sort file2:

file2 <- file2[with(file2, order(name)) , ]
Mist
  • 1,888
  • 1
  • 14
  • 21
  • This doesn't seem to retain all the information in the other columns. Is this expected? – System Jan 21 '16 at 00:04
  • So I re-tried this and I managed to retain all the info. But it isn't sorted correctly. It seems to sort based on A -> Z column 2 names instead of sorting them to the names in File 1 – System Jan 21 '16 at 00:18
  • OK, I've edited it above and it works now. Is that what you are after? – Mist Jan 21 '16 at 00:47
  • This seems to work as well. Let me test it on some of my real data instead my test data and i'll let you know! – System Jan 21 '16 at 00:49