how to sort and align 2 dataframes of unequal size based on row contents

Question

I have the results of two differential protein expression experiments which are UniProtKB IDs. I used Biomart to match the these IDs with gene names but not all IDs got matched with a gene name. I want to combine these and the original UniprotKB list and align by UniprotKB ID in order to produce a single dataset. It's my first time using R to do this type of analysis so all suggestions/code examples gratefully received.

dataframe 1

Gene_1  ID_1
Gdi2    G3GR73
Pitrm1  G3GR85
    G3GRA0
Tmem43  G3GS14
Tmf1    G3GS63
Ddx3x   G3GSH5
Bdh1    G3GSJ7
Pak2    G3GSK4
Tfrc    G3GSM5
Umps    G3GSP0
Gart    G3GT56
Pgm3    G3GTC9
Cpt2    G3GTN3
Vps26b  G3GTV9
Mthfd1l G3GU10
Rbm19   G3GU41
    G3GU60
Prkab1  G3GU67
Tigar   G3GUK0

dataframe 2

Gene_2  ID_2
Bak1    A1E3K4
Pitrm1  G3GR85
Gtpbp4  G3GR93
Lbr G3GRA0
Tmem43  G3GS14
Tmf1    G3GS63
Ddx3x   G3GSH5
Bdh1    G3GSJ7
Tfrc    G3GSM5
Umps    G3GSP0
Gart    G3GT56
Pgm3    G3GTC9
Grb2    G3GTE4
Cpt2    G3GTN3
Vps26b  G3GTV9
Mthfd1l G3GU10
Rbm19   G3GU41
    G3GU60
Prkab1  G3GU67

original data

ID_3
A1E3K4
G3GR73
G3GR85
G3GR93
G3GRA0
G3GRB1
G3GRB9
G3GRD8
G3GRE1
G3GRM2
G3GRT0
G3GRW3
G3GRX2
G3GS14
G3GS63
G3GS70
G3GS82
G3GSH2
G3GSH5

I have tried cbind and match_order functions but they dont do exactly what I want. Also tried creating a dataframe from 2 of the data sets but can't as they are different sizes.

joint <- data.frame(ori$ID_3, df_1$ID_1, df_1$Gene_1)

Error in data.frame(ori$ID_3, df_1$ID_1, df_1$Gene_1) : arguments imply differing number of rows: 1255, 544

the aim is to end up with something like this for all approx 1300 entries in dataset 3

Gene_1  ID_1    Gene_2  ID_2    ID_3
        Bak1    A1E3K4  A1E3K4
Gdi2    G3GR73          G3GR73
Pitrm1  G3GR85  Pitrm1  G3GR85  G3GR85
        Gtpbp4  G3GR93  G3GR93
    G3GRA0  Lbr G3GRA0  G3GRA0
                G3GRB1
                G3GRB9
                G3GRD8
                G3GRE1
                G3GRM2
                G3GRT0
                G3GRW3
                G3GRX2
Tmem43  G3GS14  Tmem43  G3GS14  G3GS14
Tmf1    G3GS63  Tmf1    G3GS63  G3GS63
                G3GS70
                G3GS82
                G3GSH2
Ddx3x   G3GSH5  Ddx3x   G3GSH5  G3GSH5
                G3GSJ5
Bdh1    G3GSJ7  Bdh1    G3GSJ7  G3GSJ7
Pak2    G3GSK4          G3GSK4
                G3GSL6
Tfrc    G3GSM5  Tfrc    G3GSM5  G3GSM5

It's my first time using R to do this type of analysis so all suggestions/code examples gratefully received.

Update 1

Based on Stewarts code and suggestions I ended up with the following. file 1 has 2 columns and 544 observations, file 2 has 2 columns and 419 observations, file 3 has 1254 observations. The files have joined correctly but the final file has only 33 observations rather than 1254.


getwd()

file1 <- read.csv("cigr_db.csv", sep=",", header=T)

file2 <- read.csv("picr_db.csv", sep=",", header=T)

file3 <- read.csv("progen_data.csv", sep=",", header=T)

# Change the ID column name to be the same in each dataframe, so we can match on it
colnames(file1)[2] <- 'ID'
colnames(file2)[2] <- 'ID'
colnames(file3)[1] <- 'ID'
v <- plyr::join(df1, df2, type='full')
v <- plyr::join(v, df3, type='full')

v

write.csv(v, file = "all_condt.csv")````

could you repost your data with `dput()` it will make it easier to to help you. Right now I'm just guessing what the data looks like — infominer, Jul 11 '19 at 00:15
This type of operation is called a *join*. I'd strongly suggest reading the FAQ [How to join (merge) data frames](https://stackoverflow.com/q/1299871/903061). — Gregor Thomas, Jul 11 '19 at 00:24
Can you also add the results of `dput(head(file1))` (and the same for `file2` and `file3`) so we have some sample data to test with? Your code looks like it should work, so I don't know what's going on. — Stewart Macdonald, Jul 11 '19 at 12:34
@Stewart, I found the mistake. I read my files in as file1, file 2, etc but joined df1, df2 etc. It just so happens that df1 and df2 head already existed and held similar data. Thanks for all your help. I think this chunk of code will save me weeks sweating with excel. — Peadar, Jul 11 '19 at 12:50
Glad to hear you got it working. If you're happy that you've solved this issue, mark it as answered so that others know it's resolved. — Stewart Macdonald, Jul 11 '19 at 13:07

score 0 · Accepted Answer · answered Jul 11 '19 at 00:42

Do you need the ID_X column duplicated, given that all three of those columns will contain the same value? If not, you could use the join function from the plyr package:

library(plyr)

file1 <- 'Gene_1,ID_1
Gdi2,G3GR73
Pitrm1,G3GR85
,G3GRA0
Tmem43,G3GS14
Tmf1,G3GS63
Ddx3x,G3GSH5
Bdh1,G3GSJ7
Pak2,G3GSK4
Tfrc,G3GSM5
Umps,G3GSP0
Gart,G3GT56
Pgm3,G3GTC9
Cpt2,G3GTN3
Vps26b,G3GTV9
Mthfd1l,G3GU10
Rbm19,G3GU41
,G3GU60
Prkab1,G3GU67
Tigar,G3GUK0'

df1 <- read.table(textConnection(file1), sep=",", header=T)

file2 <- 'Gene_2,ID_2
Bak1,A1E3K4
Pitrm1,G3GR85
Gtpbp4,G3GR93
Lbr,G3GRA0
Tmem43,G3GS14
Tmf1,G3GS63
Ddx3x,G3GSH5
Bdh1,G3GSJ7
Tfrc,G3GSM5
Umps,G3GSP0
Gart,G3GT56
Pgm3,G3GTC9
Grb2,G3GTE4
Cpt2,G3GTN3
Vps26b,G3GTV9
Mthfd1l,G3GU10
Rbm19,G3GU41
,G3GU60
Prkab1,G3GU67'

df2 <- read.table(textConnection(file2), sep=",", header=T)

file3 <- 'ID_3
A1E3K4
G3GR73
G3GR85
G3GR93
G3GRA0
G3GRB1
G3GRB9
G3GRD8
G3GRE1
G3GRM2
G3GRT0
G3GRW3
G3GRX2
G3GS14
G3GS63
G3GS70
G3GS82
G3GSH2
G3GSH5'

df3 <- read.table(textConnection(file3), sep=",", header=T)

# Change the ID column name to be the same in each dataframe, so we can match on it
colnames(df1)[2] <- 'ID'
colnames(df2)[2] <- 'ID'
colnames(df3)[1] <- 'ID'
v <- plyr::join(df1, df2, type='full')
v <- plyr::join(v, df3, type='full')

Which gives:

> v
    Gene_1     ID  Gene_2
1     Gdi2 G3GR73    <NA>
2   Pitrm1 G3GR85  Pitrm1
3          G3GRA0     Lbr
4   Tmem43 G3GS14  Tmem43
5     Tmf1 G3GS63    Tmf1
6    Ddx3x G3GSH5   Ddx3x
7     Bdh1 G3GSJ7    Bdh1
8     Pak2 G3GSK4    <NA>
9     Tfrc G3GSM5    Tfrc
10    Umps G3GSP0    Umps
11    Gart G3GT56    Gart
12    Pgm3 G3GTC9    Pgm3
13    Cpt2 G3GTN3    Cpt2
14  Vps26b G3GTV9  Vps26b
15 Mthfd1l G3GU10 Mthfd1l
16   Rbm19 G3GU41   Rbm19
17         G3GU60        
18  Prkab1 G3GU67  Prkab1
19   Tigar G3GUK0    <NA>
20    <NA> A1E3K4    Bak1
21    <NA> G3GR93  Gtpbp4
22    <NA> G3GTE4    Grb2
23    <NA> G3GRB1    <NA>
24    <NA> G3GRB9    <NA>
25    <NA> G3GRD8    <NA>
26    <NA> G3GRE1    <NA>
27    <NA> G3GRM2    <NA>
28    <NA> G3GRT0    <NA>
29    <NA> G3GRW3    <NA>
30    <NA> G3GRX2    <NA>
31    <NA> G3GS70    <NA>
32    <NA> G3GS82    <NA>
33    <NA> G3GSH2    <NA>

Hi all, thanks for those suggestions. There is some helpful advice there. — Peadar, Jul 11 '19 at 11:27
@Steward You sir, are a godsend. However, I do now have one further query. The genes and id's i posted were taken from much larger files. I read the files into R rather than the raw data and tried to use your code to join them. file1 <- read.csv("cigr_db.csv") df1 <- read.table(textConnection(file1), sep=",", header=T) and get the following Error in textConnection(file1) : invalid 'text' argument. — Peadar, Jul 11 '19 at 11:32
Yep. Get rid of the `textConnection()` bit. That's used to read the text directly from a variable. You just want to read them directly from a file, using `read.csv` in the usual manner. — Stewart Macdonald, Jul 11 '19 at 11:38
I am assuming that this is to do with tables as opposed to .csv files. But Iam unsure how to change that. While using tables is not an issue it would be easier to just read in and join the text file which is outputted by Biomart. — Peadar, Jul 11 '19 at 11:38
that worked with one final snag. file three has 1254 observations. lots of these will not have matches in files 1 and 2 but that's ok. However the joined file has only 33 observations. Is there a way to get all 1254 observations in the joined file. Am a complete novice. I read the file in as follows file1 <- read.csv("cigr_db.csv", sep=",", header=T) and kept everything else the same. Am happy to share the files if necessary and you advise how. Sorry for all the questions. — Peadar, Jul 11 '19 at 11:52
Can you post an updated version of your complete code? Just edit the question, leaving the original text and adding the new code at the bottom after an "Update 1" heading. — Stewart Macdonald, Jul 11 '19 at 11:58
Also, make sure you read the link that @Gregor posted so that you have a good understanding of what's happening. Getting your head around joins will be really useful if you continue working in R. The type of join you want is a `full` or `outer` join, which should return all rows, even if they don't have matches in the other table. — Stewart Macdonald, Jul 11 '19 at 12:04

how to sort and align 2 dataframes of unequal size based on row contents

1 Answers1