1

A newbie to textmining analysis and R coding.

I have 200 genes with mixed string. I want to split them and paste strings (eg, cadherins, orphan receptors) in one column and numbers (eg, 2/3), number+string (eg, 7D, 7TM) in another column. I used strssplit to split the words. Please any suggestion on how to parse them would be helpful.

example:
 > Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs  RNA28S", "45S pre-ribosomal RNAs  RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)

Expected result(2nd and 3rd column):

7D cadherins        cadherins       7D 
7TM orphan receptors        orphan receptors        7TM   
18S ribosomal RNAs  RNA18S  ribosomal RNAs  RNA18S  18S RNA18S
28S ribosomal RNAs  RNA28S  ribosomal RNAs  RNA28S  28S  RNA28S
45S pre-ribosomal RNAs  RNA45S  pre-ribosomal RNAs      45S  RNA45S
5.8S ribosomal RNAs ribosomal RNAs  5.8S
Actin related protein 2/3 complex   Actin related protein complex    2/3 
NG-K
  • 25
  • 7

1 Answers1

1

Using strsplit to split the names, grep to detect words with or without numbers and paste to collapse the words. Put everithing in a function to avoid repetition:

wordS <- function(x, invert = TRUE) {
  clean <- gsub( '[[:space:]]+', ' ', x )  # to remove extra spaces
  split <- strsplit( clean, ' ' )
  detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
  words <- sapply( detec, paste, collapse = ' ' )
  return( words )
}

data.frame(
  Gene = Genes,
  column2 = wordS(Genes),
  column3 = wordS(Genes, invert = FALSE)
)

                               Gene                       column2    column3
1                      7D cadherins                     cadherins         7D
2              7TM orphan receptors              orphan receptors        7TM
3       7TM orphan receptors RNA18S              orphan receptors 7TM RNA18S
4         28S ribosomal RNAs RNA28S                ribosomal RNAs 28S RNA28S
5     45S pre-ribosomal RNAs RNA45S            pre-ribosomal RNAs 45S RNA45S
6               5.8S ribosomal RNAs                ribosomal RNAs       5.8S
7 Actin related protein 2/3 complex Actin related protein complex        2/3