Splitting and pairing the values of two columns

Question

I have a data frame like this:

Entry name  Gene names
A1BG_HUMAN      A1BG
M0R009_HUMAN    A1BG
F8W9F8_HUMAN    A1CF
Q5T0W7_HUMAN    A1CF
A1CF_HUMAN      A1CF ACF ASP
H0YFH1_HUMAN    A2M
A2MG_HUMAN      A2M CPAMD5 FWP007

In the 1st column I have protein names and in the 2nd column I have related genes. In front of some of the proteins multiple gene names can be seen which are basically aliases of the first gene in that cell (separated by 1 space).

I want to convert this data set to a form which every protein name is paired with different gene names, so that I have sth like this:

Entry name  Gene names
A1BG_HUMAN      A1BG
M0R009_HUMAN    A1BG
F8W9F8_HUMAN    A1CF
F8W9F8_HUMAN    ACF
F8W9F8_HUMAN    ASP
Q5T0W7_HUMAN    A1CF
Q5T0W7_HUMAN    ACF
Q5T0W7_HUMAN    ASP
A1CF_HUMAN      A1CF
A1CF_HUMAN      ACF
A1CF_HUMAN      ASP
H0YFH1_HUMAN    A2M
H0YFH1_HUMAN    CPAMD5
H0YFH1_HUMAN    FWP007
A2MG_HUMAN      A2M
A2MG_HUMAN      CPAMD5
A2MG_HUMAN      FWP007

I know how to split the cells with multiple entries into different rows but I am not sure how to pair proteins in the first column with different aliases of a gene.

Does any one have any idea on how to do it?

EDIT: I don't want to only split the data into different rows. So cSplit will not actually help me here. Let me bring an example:

In front of A1CF_HUMAN, different aliases of A1CF gene are brought (ACF & ASP). I want to not only pair A1CF_HUMAN with ACF & ASP, but also pair other proteins which are associated with A1CF gene (F8W9F8_HUMAN & Q5T0W7_HUMAN) with ACF & ASP as well. Please take a look at what I want above to better understand what I am exactly looking for. I don't think that it can be done by a single command.

`I want to convert this data set to a form which every protein name is paired with different gene names, so that I have sth like this:` Does this `sth like` the exact output you expected or you just included some combinations? — akrun, Jul 19 '16 at 07:26
@akrun I did not reply directly to you. My post was flagged and I wanted to specify that I am not looking for what `cSplit` does. I replied to Martin's post as well. Yes, this is exactly what I want, just the order of rows is not important. Martin's answer does the job so thank you very much for your help. — Ali, Jul 19 '16 at 14:04
@Learner My data has ~ 70,000 rows. the `dput` of the above portion is: `structure(list(Entry.name = structure(c(1L, 8L, 6L, 9L, 2L, 7L, 4L, 5L, 3L), .Label = c("A1BG_HUMAN", "A1CF_HUMAN", "A2MG_HUMAN", "F5H1E8_HUMAN", "F8W7L3_HUMAN", "F8W9F8_HUMAN", "H0YFH1_HUMAN", "M0R009_HUMAN", "Q5T0W7_HUMAN"), class = "factor"), Gene.names = structure(c(1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 5L), .Label = c("A1BG", "A1CF", "A1CF ACF ASP", "A2M", "A2M CPAMD5 FWP007"), class = "factor")), .Names = c("Entry.name", "Gene.names"), class = "data.frame", row.names = c(NA, -9L))` — Ali, Jul 19 '16 at 14:06

Martin Morgan · Accepted Answer · 2016-07-19T09:27:47.730

2

Assuming the first element is always the 'key' and the rest are aliases, split the gene names, identify the key, then group all aliases by key, and standardize each element to include aliases

elts = strsplit(df$Gene_names, " ")
keys = sapply(elts, "[[", 1)
values = split(unlist(elts), rep(keys, lengths(elts)))
df$Gene_names = lapply(values, unique)[keys]

Use the length of each standardized gene name to replicate the entry names, and match these to the unlisted, split gene names

data.frame(
    Entry_name = rep(df$Entry_name, lengths(df$Gene_names)),
    Gene_name = unlist(df$Gene_names))

edited Jul 19 '16 at 09:27

answered Jul 19 '16 at 04:58

Martin Morgan

45,935
7
84
112

Thanks Martin for your reply but as I explained more, this is not exactly what I am interested in. – Ali Jul 19 '16 at 06:10

akrun · Answer 2 · 2016-07-19T18:35:20.880

We can use cSplit with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), extract the first word (word(Gene_names,1) from stringr should also work) from 'Gene_names' with sub, get a logical index using duplicated, we use the cumulative sum of the logical vector as grouping variable, and assign "Gene_names" to the one with the longest characters. Then, use cSplit to convert it to 'long' format.

library(splitstackshape)
library(data.table)
setDT(df)[, Gene_names := Gene_names[which.max(nchar(Gene_names))] , 
           cumsum(!duplicated(sub("\\s+.*", "", Gene_names)))][]
cSplit(df, "Gene_names", " ", "long")
#      Entry_name Gene_names
# 1:   A1BG_HUMAN       A1BG
# 2: M0R009_HUMAN       A1BG
# 3: F8W9F8_HUMAN       A1CF
# 4: F8W9F8_HUMAN        ACF
# 5: F8W9F8_HUMAN        ASP
# 6: Q5T0W7_HUMAN       A1CF
# 7: Q5T0W7_HUMAN        ACF
# 8: Q5T0W7_HUMAN        ASP
# 9:   A1CF_HUMAN       A1CF
#10:   A1CF_HUMAN        ACF
#11:   A1CF_HUMAN        ASP
#12: H0YFH1_HUMAN        A2M
#13: H0YFH1_HUMAN     CPAMD5
#14: H0YFH1_HUMAN     FWP007
#15:   A2MG_HUMAN        A2M
#16:   A2MG_HUMAN     CPAMD5
#17:   A2MG_HUMAN     FWP007

data

df <- structure(list(Entry_name = c("A1BG_HUMAN", "M0R009_HUMAN", 
"F8W9F8_HUMAN", 
"Q5T0W7_HUMAN", "A1CF_HUMAN", "H0YFH1_HUMAN", "A2MG_HUMAN"), 
Gene_names = c("A1BG", "A1BG", "A1CF ACF ASP", "A1CF ACF ASP", 
"A1CF ACF ASP", "A2M CPAMD5 FWP007", "A2M CPAMD5 FWP007")),
.Names = c("Entry_name", 
"Gene_names"), class = "data.frame", row.names = c(NA, -7L))

Thank you very much for your help! but it does not give me the result I want. — Ali, Jul 19 '16 at 14:10
@Ali Now, it does give the expected output you showed in the post. However, you are free to say that it doesn't work. — akrun, Jul 19 '16 at 18:28
haha Thanks so much! It works as I expected this time! :) Unfortunately I don't have enough reputation to upvote your answer... — Ali, Jul 20 '16 at 15:50

Splitting and pairing the values of two columns

2 Answers2

data