2

I have a dataset I downloaded from The Human Protein Atlas which has annotations for the subcellular localization of 12,004 proteins. This file I've subset to only include "Gene name" and then 4 columns for how reliable that location is (based on immunofluorescently stained cells). Theses are "Validated">"Supported">"Approved">"Uncertain".

I've came up with a scoring system I would like to apply to LC-MS spectral count dataset I have by 1) weighing the quality of annotation and 2) penalizing how many locations the protein is found in image of proposed scoring system.

The TLDR is that I need to count how many terms there is in each column of the following data set and get a dataframe of this information.

df <- read.csv("proteinAtlas.csv")
dput(df)
structure(list(Gene_symbol = structure(1:49, .Label = c("AAAS", 
"AAMP", "AAR2", "AARD", "AARS", "AARS2", "AARSD1", "ABCA13", 
"ABCB6", "ABCB7", "ABCB8", "ABCC1", "ABCC4", "ABCD3", "ABCE1", 
"ABCF1", "ABCF2", "ABCF3", "ABHD10", "ABHD14B", "ABHD6", "ABI1", 
"ABI2", "ABL2", "ACAA1", "ACAA2", "ACACA", "ACAD9", "ACADM", 
"ACADS", "ACADVL", "ACAP1", "ACAP2", "ACAT1", "ACAT2", "ACBD3", 
"ACBD5", "ACIN1", "ACLY", "ACO2", "ACOT1", "ACOT13", "ACOT2", 
"ACOT7", "ACOT8", "ACOT9", "ACOX1", "ACP1", "ACP5"), class = "factor"), 
    Validated = structure(c(1L, 2L, 1L, 1L, 2L, 4L, 1L, 1L, 3L, 
    1L, 1L, 1L, 1L, 5L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    5L, 1L, 1L, 4L, 4L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 5L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L), .Label = c("", "Cytosol", 
    "Golgi apparatus", "Mitochondria", "Peroxisomes", "Vesicles"
    ), class = "factor"), Supported = structure(c(1L, 9L, 1L, 
    1L, 1L, 1L, 1L, 1L, 5L, 10L, 10L, 12L, 1L, 1L, 1L, 1L, 4L, 
    1L, 1L, 6L, 1L, 3L, 1L, 11L, 1L, 10L, 2L, 1L, 1L, 10L, 10L, 
    1L, 1L, 1L, 4L, 8L, 1L, 11L, 7L, 10L, 1L, 1L, 1L, 4L, 13L, 
    1L, 1L, 1L, 1L), .Label = c("", "Actin filaments;Cytosol", 
    "Cell Junctions;Plasma membrane", "Cytosol", "Cytosol;Mitochondria;Nucleoplasm;Plasma membrane", 
    "Cytosol;Nucleoli;Nucleus", "Cytosol;Nucleoplasm;Plasma membrane", 
    "Golgi apparatus", "Microtubules", "Mitochondria", "Nucleoplasm", 
    "Plasma membrane", "Vesicles"), class = "factor"), Approved = structure(c(3L, 
    1L, 5L, 12L, 1L, 1L, 6L, 4L, 1L, 1L, 17L, 1L, 8L, 1L, 1L, 
    1L, 1L, 7L, 13L, 1L, 16L, 1L, 15L, 1L, 1L, 1L, 14L, 1L, 1L, 
    15L, 17L, 18L, 11L, 1L, 17L, 1L, 1L, 1L, 1L, 1L, 13L, 2L, 
    13L, 15L, 13L, 9L, 17L, 10L, 5L), .Label = c("", "Cell Junctions", 
    "Centrosome;Cytosol;Nuclear membrane", "Centrosome;Cytosol;Vesicles", 
    "Cytosol", "Cytosol;Nuclear membrane", "Cytosol;Nucleoli", 
    "Cytosol;Nucleoli;Plasma membrane", "Cytosol;Nucleoplasm;Plasma membrane", 
    "Cytosol;Nucleus", "Endosomes", "Lipid droplets", "Mitochondria", 
    "Nucleoli fibrillar center", "Nucleoplasm", "Nucleoplasm;Vesicles", 
    "Nucleus", "Vesicles"), class = "factor"), Uncertain = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
    1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("", "Cytosol;Plasma membrane", "Nucleoli"
    ), class = "factor")), .Names = c("Gene_symbol", "Validated", 
"Supported", "Approved", "Uncertain"), class = "data.frame", row.names = c(NA, 
-49L))

So the ideal output would look like this figure or, if you prefer, dput():

structure(list(Gene_symbol = structure(1:29, .Label = c("AAAS", 
"AAMP", "AAR2", "AARD", "AARS", "AARS2", "AARSD1", "ABCA13", 
"ABCB6", "ABCB7", "ABCB8", "ABCC1", "ABCC4", "ABCD3", "ABCE1", 
"ABCF1", "ABCF2", "ABCF3", "ABHD10", "ABHD14B", "ABHD6", "ABI1", 
"ABI2", "ABL2", "ACAA1", "ACAA2", "ACACA", "ACAD9", "ACADM"), class = "factor"), 
    Validated = c(NA, 1L, NA, NA, 1L, 1L, NA, NA, 1L, NA, NA, 
    NA, NA, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 
    NA, 1L, 1L), Supported = c(NA, 1L, NA, NA, NA, NA, NA, NA, 
    4L, 1L, 1L, 1L, NA, NA, NA, NA, 1L, NA, NA, 3L, NA, 2L, NA, 
    1L, NA, 1L, 2L, NA, NA), Approved = c(3L, NA, 1L, 1L, NA, 
    NA, 2L, 3L, NA, NA, 1L, NA, 3L, NA, NA, NA, NA, 2L, 1L, NA, 
    2L, NA, 1L, NA, NA, NA, 1L, NA, NA), Uncertain = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Gene_symbol", 
"Validated", "Supported", "Approved", "Uncertain"), class = "data.frame", row.names = c(NA, 
-29L))

For the most part in each column it's a string separated by ";" however, in some cases their are terms like "Nucleoli fibrillar center" or "Lipid droplets" which are separated by spaces and should be counted as one word/term

I've found examples of counting the number of words in a string in R where:

d <- "foo,bar,fun"
length(strsplit(d,",")[[1]]
class(d)

But this only works on the "character" class and not "data.frame".

Can anyone suggest how to do this in R? Many thanks!

  • In the example provided, ithe separation is by `;` Also, you mentioned about space separation but there are words such as `Lipid droplets` which are counted as 1. It is not clear – akrun Nov 19 '17 at 17:43
  • Thank you for noticing the typo. I edited my post to clear things up. I would like to count terms not words. `Lipid droplets` is a term which contains 2 _words_. I would like to count the terms separated by a semi-colon `;` – Matthew J. Oldach Nov 20 '17 at 09:39

2 Answers2

1

We can use str_count. Loop over the columns except the first one (lapply(df[-1], ..), get the count of ; add 1 to it, check for cases where there is empty string and replace those elements with NA

library(stringr)
df[-1] <- lapply(df[-1], function(x) (str_count(x, ";") + 1) * NA^(as.character(x) == ""))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I up-voted both answers but it doesn't show up because I don't have 15 reputation yet, and I didn't know yet about accepting answers but have chosen yours since it the easiest to understand. Thanks again! – Matthew J. Oldach Nov 20 '17 at 15:26
  • @MatthewJ.Oldach Yes, you need 15 points. Thanks for posting the question to allow us to answer. – akrun Nov 20 '17 at 15:27
0

A solution using base:

result_df <- data.frame(t(apply(df,1,function(x){
    c(x[1],sapply(strsplit(as.character(x[-1]),";"),length))
})), stringsAsFactors = F)
names(result_df) <- c("Gene_symbol", "Validated", "Supported", "Approved", "Uncertain")
tobiasegli_te
  • 1,413
  • 1
  • 12
  • 18