Sort a data frame column based on another txt file

Question

I have a txt file of transcripts ids (transcripts) and a dataframe (t2g) with 3 columns (transcripts ids, gene ids, gene symbols), they have the same length.

> head(transcripts)
                     V1
1  ENSMUST00000000266.8
2 ENSMUST00000000514.10
3  ENSMUST00000000834.3
4  ENSMUST00000001027.6
5 ENSMUST00000001166.13
6 ENSMUST00000001171.12

> head(t2g)
                    V1                   V2            V3
1 ENSMUST00000001171.12 ENSMUSG00000102693.1 4933401J01Rik
2 ENSMUST00000001027.6 ENSMUSG00000064842.1       Gm26206
3 ENSMUST00000000834.3 ENSMUSG00000051951.5          Xkr4
4 ENSMUST00000000266.8 ENSMUSG00000051951.5          Xkr4
5 ENSMUST00000000514.10 ENSMUSG00000051951.5          Xkr4
6 ENSMUST00000001166.13 ENSMUSG00000102851.1       Gm18956

I'd need the data frame ordered as the txt file, that cannot be sorted.

I started in R by renaming dataframe rownames as the transcripts in the txt file

> vector <- transcripts[,1]
> rownames(t2g) <- vector 
> head(t2g)
                                        V1                   V2            V3
ENSMUST00000000266.8  ENSMUST00000001171.12 ENSMUSG00000102693.1 4933401J01Rik
ENSMUST00000000514.10 ENSMUST00000001027.6 ENSMUSG00000064842.1       Gm26206
ENSMUST00000000834.3  ENSMUST00000000834.3 ENSMUSG00000051951.5          Xkr4
ENSMUST00000001027.6  ENSMUST00000000266.8 ENSMUSG00000051951.5          Xkr4
ENSMUST00000001166.13 ENSMUST00000000514.10 ENSMUSG00000051951.5          Xkr4
ENSMUST00000001171.12 ENSMUST00000001166.13 ENSMUSG00000102851.1       Gm18956

But I don't know how to move forward from here. Any idea? Thank you

Your output can be mimicked with `cbind(transcripts, t2g)`. You could also just do `t2g$transcript <- transcripts$V1`. — r2evans, Jan 07 '21 at 18:19
Hi, thank you for your answer. These two methods only add an extra column to the data frame but they don't change the order of the other columns. I need all the columns to be re-ordered based on the transcripts.txt file — bman, Jan 07 '21 at 18:47
I admit that I have *absolutely no idea* what you need. You said *"ordered as the txt file"*, but since there are not matching elements, then discussing order in this fashion seems meaningless. Your suggested output has done nothing to the sort order either. Please update your sample data to provide clarity. — r2evans, Jan 07 '21 at 19:22
You're right, I uploaded only the head of the files so it's not clear at all. Transcript names are the same in the 2 files. Updated now — bman, Jan 07 '21 at 19:58

score 1 · Answer 1 · answered Jan 07 '21 at 20:10

Assuming the IDs are unique, we can use match:

t2g[match(transcripts[[1]], t2g[[1]]), ]
#                      V1                   V2            V3
# 4  ENSMUST00000000266.8 ENSMUSG00000051951.5          Xkr4
# 5 ENSMUST00000000514.10 ENSMUSG00000051951.5          Xkr4
# 3  ENSMUST00000000834.3 ENSMUSG00000051951.5          Xkr4
# 2  ENSMUST00000001027.6 ENSMUSG00000064842.1       Gm26206
# 6 ENSMUST00000001166.13 ENSMUSG00000102851.1       Gm18956
# 1 ENSMUST00000001171.12 ENSMUSG00000102693.1 4933401J01Rik

Using this data:

transcripts = read.table(text = '                     V1
1  ENSMUST00000000266.8
2 ENSMUST00000000514.10
3  ENSMUST00000000834.3
4  ENSMUST00000001027.6
5 ENSMUST00000001166.13
6 ENSMUST00000001171.12', header = T)

t2g = read.table(text = '                    V1                   V2            V3
1 ENSMUST00000001171.12 ENSMUSG00000102693.1 4933401J01Rik
2 ENSMUST00000001027.6 ENSMUSG00000064842.1       Gm26206
3 ENSMUST00000000834.3 ENSMUSG00000051951.5          Xkr4
4 ENSMUST00000000266.8 ENSMUSG00000051951.5          Xkr4
5 ENSMUST00000000514.10 ENSMUSG00000051951.5          Xkr4
6 ENSMUST00000001166.13 ENSMUSG00000102851.1       Gm18956', header = T)

r2evans · Answer 2 · 2021-01-07T21:15:18.347

Another option is to merge the data

EDIT: as @GregorThomas corrected me, base::merge sorts the data by default. Adding sort=FALSE should correct that.

merge(transcripts, t2g, by = "V1", all = TRUE, sort = FALSE)
#                      V1                   V2            V3
# 1  ENSMUST00000000266.8 ENSMUSG00000051951.5          Xkr4
# 2 ENSMUST00000000514.10 ENSMUSG00000051951.5          Xkr4
# 3  ENSMUST00000000834.3 ENSMUSG00000051951.5          Xkr4
# 4  ENSMUST00000001027.6 ENSMUSG00000064842.1       Gm26206
# 5 ENSMUST00000001166.13 ENSMUSG00000102851.1       Gm18956
# 6 ENSMUST00000001171.12 ENSMUSG00000102693.1 4933401J01Rik

Something to look out for is NA in columns (other than V1).

if a V1 exists in transcripts but not in t2g, then all of the other (non-V1) t2g columns will be NA;
if a V1 exists in t2g but not in transcripts, then all of the other (non-V1) transcripts columns will be NA. Granted, in this case there are no other columns, so this will not be an issue.

The number of rows returned will be no less than the number of rows in either of the frames. It will be greater if there are any rows in one and not the other.

This is what is known as a "full join" between two frames/tables. There are two good links for learning about the different types of joins:

How to join (merge) data frames (inner, outer, left, right) gives a good description of each;
and https://stackoverflow.com/a/6188334/3358272 gives good venn-diagram pictures

In short:

"left join" preserves all rows from the left-hand side (LHS), which in this case is transcripts (the first of the two frames merged); rows in the RHS frame that do not match any of the by elements in the LHS will be dropped; if the matches are at most 1-to-1 (meaning there are no duplicate ids in the second frame), then this will produce the same number of rows in the output as the LHS frame; if there are duplicate ids in the RHS, then the number of rows will increase;
"right join" similar, but reverse LHS/RHS;
"full join" preserves all rows from both sides; this should never produce fewer rows than the longest frame
(there are several more types)

Demonstration of GregorThomas's point about merge(..., sort).

First, change the order, where I swap rows 3-4:

transcripts <- read.table(header = TRUE, text = "
                     V1 V2
1  ENSMUST00000000266.8  1
2 ENSMUST00000000514.10  2
4  ENSMUST00000001027.6  4
3  ENSMUST00000000834.3  3
5 ENSMUST00000001166.13 5
6 ENSMUST00000001171.12  6", stringsAsFactors = TRUE)

Now the two merges, one with the default of sort=TRUE (implied), one not sorted.

merge(transcripts, t2g, by = "V1", all = TRUE)
#                      V1 V2.x                 V2.y            V3
# 1  ENSMUST00000000266.8    1 ENSMUSG00000051951.5          Xkr4
# 2 ENSMUST00000000514.10    2 ENSMUSG00000051951.5          Xkr4
# 3  ENSMUST00000000834.3    3 ENSMUSG00000051951.5          Xkr4
# 4  ENSMUST00000001027.6    4 ENSMUSG00000064842.1       Gm26206
# 5 ENSMUST00000001166.13    5 ENSMUSG00000102851.1       Gm18956
# 6 ENSMUST00000001171.12    6 ENSMUSG00000102693.1 4933401J01Rik
merge(transcripts, t2g, by = "V1", all = TRUE, sort = FALSE)
#                      V1 V2.x                 V2.y            V3
# 1  ENSMUST00000000266.8    1 ENSMUSG00000051951.5          Xkr4
# 2 ENSMUST00000000514.10    2 ENSMUSG00000051951.5          Xkr4
# 3  ENSMUST00000001027.6    4 ENSMUSG00000064842.1       Gm26206
# 4  ENSMUST00000000834.3    3 ENSMUSG00000051951.5          Xkr4
# 5 ENSMUST00000001166.13    5 ENSMUSG00000102851.1       Gm18956
# 6 ENSMUST00000001171.12    6 ENSMUSG00000102693.1 4933401J01Rik

The second shows my expected output. Thanks again.

I think this will actually `sort` the columns, lexicograhically, not preserve the order in `transcripts`? So I think (assuming all IDs have matches) this is equivalent to `t2g[order(t2g$V1), ]`? — Gregor Thomas, Jan 07 '21 at 20:17
This use of `merge` produces a left-join, which should preserve the order of the LHS. Unless I'm mistaken, doesn't left-join ensure the order and row-presence of the LHS? I've always assumed it does, and haven't *seen* warnings/errors that would result from this assumption being wrong. — r2evans, Jan 07 '21 at 20:17
@GregorThomas, I meant "full join", but it should hold true for left joins too. — r2evans, Jan 07 '21 at 20:26
Yes the ids are unique. Something is wrong with `merge` because my output is simply sorted numerically. `match` works just fine! Thank you!! — bman, Jan 07 '21 at 20:30
I think the `dplyr` join functions are good about preserving order, but `?base::merge` says *"The rows are by default lexicographically sorted on the common columns, but for `sort = FALSE` are in an unspecified order."* I remember this as a big annoyance with `merge` years ago when I first started using `ggmap` - merging geo shape data with other data with `merge` would mess up the plots unless you added an ID column first to sort by at the end. In the early releases of `dplyr` their fidelity to the original data order was touted as an advantage over `merge`. — Gregor Thomas, Jan 07 '21 at 20:53
Oh man, great point @GregorThomas, I've been using `dplyr::*_join` and otherwise explicitly sorting elsewhere that I had forgotten that point. But I recall it now, and you are absolutely right. I need to practice what I preach more (rtfm). Thanks for the correction! — r2evans, Jan 07 '21 at 21:12

Sort a data frame column based on another txt file

2 Answers2