Split string into single letters and remember position

Question

I have a dataset like this:

# test data
test.table <- data.frame(
  id = seq(1,3),
  sequence = c('HELLOTHISISASTRING','STRING|IS||18|LONG','SOMEOTHERSTRING!!!')
)

Each sequence has the same length (18). Now I want to create a table like this:

#id  position letter
#1   1        H
#1   2        E
#1   3        L
#.....etc

Although I know I can split the strings using strsplit, like so:

splitted <- strsplit(as.character(test.table$sequence), '')

I can't figure out how this should be converted to my preferred format?

Sure! Obviously my real data doesn't look like these meaningless strings. I basically have two datasets. One like this, containing ids + strings and another one having the format id, position, positional characteristics. I need to join both datasets to obtain a table like: id position letter positional characteristic. Hopefully this make sense without going into my dataset too much @TimBiegeleisen — CodeNoob, Sep 07 '18 at 08:34
`stack(setNames(strsplit(as.character(test.table$sequence), ""), test.table$id))`, [see also this Q&A](https://stackoverflow.com/q/13773770/2204410) — Jaap, Sep 07 '18 at 08:38
Do you need `l1 <- unlist(strsplit(as.character(test.table$sequence), '')); data.frame(position = seq_along(l1), letter = l1)` ? — Ronak Shah, Sep 07 '18 at 08:39
Just add rowid, e.g.: `res <- stack(setNames(strsplit(as.character(test.table$sequence), ""), test.table$id)); res$rowID <- 1:18` — zx8754, Sep 07 '18 at 08:46
@CodeNoob It was indeed a starter, posted a complete answer below. — Jaap, Sep 07 '18 at 08:47
Based on your older questions, this is a bio data, right? Could you explain in bio context what we are trying to achieve? — zx8754, Sep 07 '18 at 09:05

score 1 · Answer 1 · edited Sep 07 '18 at 17:10

You can use tidyverse tools:

test.table <- data.frame(
  id = seq(1,3),
  sequence = c('HELLOTHISISASTRING','STRING|IS||18|LONG','SOMEOTHERSTRING!!!')
)
library(tidyverse)

test.table %>%
  mutate(letters = str_split(sequence, "")) %>%
  unnest %>%
  group_by(id, sequence) %>%
  mutate(position = row_number())
#> # A tibble: 54 x 4
#> # Groups:   id, sequence [3]
#>       id sequence           letters position
#>    <int> <fct>              <chr>      <int>
#>  1     1 HELLOTHISISASTRING H              1
#>  2     1 HELLOTHISISASTRING E              2
#>  3     1 HELLOTHISISASTRING L              3
#>  4     1 HELLOTHISISASTRING L              4
#>  5     1 HELLOTHISISASTRING O              5
#>  6     1 HELLOTHISISASTRING T              6
#>  7     1 HELLOTHISISASTRING H              7
#>  8     1 HELLOTHISISASTRING I              8
#>  9     1 HELLOTHISISASTRING S              9
#> 10     1 HELLOTHISISASTRING I             10
#> # ... with 44 more rows

Created on 2018-09-07 by the reprex package (v0.2.0).

Jaap · Answer 2 · 2018-09-07T08:53:51.823

A base R solution:

df <- stack(setNames(strsplit(as.character(test.table$sequence), ""), test.table$id))[2:1]
df$pos <- with(df, ave(values, ind, FUN = seq_along))

which gives:

> df
   ind values pos
1    1      H   1
2    1      E   2
3    1      L   3
4    1      L   4
5    1      O   5
6    1      T   6
7    1      H   7
8    1      I   8
....

Or using data.table:

library(data.table)
setDT(test.table)

test.table[, .(letter = unlist(tstrsplit(sequence, "", fixed=TRUE))), id
           ][, pos := rowid(id)][]

which gives the same result:

    id letter pos
 1:  1      H   1
 2:  1      E   2
 3:  1      L   3
 4:  1      L   4
 5:  1      O   5
 6:  1      T   6
 7:  1      H   7
 8:  1      I   8
....

Sotos · Accepted Answer · 2018-09-07T08:54:56.107

There is a handy package about operations like such called splitstackshape.

library(splitstackshape)

dt1 <- cSplit(test.table, 'sequence', sep = '', direction = 'long', stripWhite = FALSE)
dt1$pos <- seq(18)

which gives,

    id sequence pos
 1:  1        H   1
 2:  1        E   2
 3:  1        L   3
 4:  1        L   4
 5:  1        O   5
 6:  1        T   6
 7:  1        H   7
 8:  1        I   8
 9:  1        S   9
10:  1        I  10
...

Saurabh Chauhan · Answer 4 · 2018-09-07T08:45:06.290

Try this using stringi package:

library(stringi)
data=data.frame()
for(i in 1:nrow(test.table)){ # For each id
 # Split the data for each index and store the itermediate result and 
 # bind it as id, position and letter
 df=cbind(test.table$id[i],1: stri_length(test.table$sequence[i]),stri_sub(test.table$sequence[i],
     seq(1, stri_length(test.table$sequence[i]),by=1), length=1))
 data=rbind(data,df) # Append each id result to data
} 
colnames(data)=c('id','position','letter')

Output:

  id position letter
1  1        1      H
2  1        2      E
3  1        3      L
4  1        4      L
5  1        5      O
6  1        6      T

Vlad C. · Answer 5 · 2018-09-07T08:54:24.597

0

There are some good answers here already, but here is another way to do it using tidyverse.

test.table <- data.frame(
  id = seq(1,3),
  sequence = c('HELLOTHISISASTRING','STRING|IS||18|LONG','SOMEOTHERSTRING!!!')
)

library(tidyverse)
library(reshape2)

test.table %>% 
  separate(col=sequence, into=as.character(1:18), sep=1:17) %>% 
  melt('id', value.name = 'letter', variable.name='position') %>% 
  arrange(id, position)

In the above code, the separate function from tidyr separates the sequence column into 18 separate columns (naming them 1 to 18) and then those are melted into the letter and position columns.

edited Sep 07 '18 at 08:54

answered Sep 07 '18 at 08:51

Vlad C.

944
7
12

It's from `reshape2`, which apparently is not loaded with `tidyverse` - I added `library(reshape2)`. Thank you! – Vlad C. Sep 07 '18 at 08:57
3

No need for reshape, tidyverse (tidyr package) has the `gather` function, use that instead. – zx8754 Sep 07 '18 at 08:58

score 0 · Answer 6 · answered Sep 07 '18 at 09:01

Answer is not as per requirement, but guessing based on your comment, we might need this instead:

chartr("HES", "ZXY", test.table$sequence)
# [1] "ZXLLOTZIYIYAYTRING" "YTRING|IY||18|LONG" "YOMXOTZXRYTRING!!!"

Where we are replacing every H with Z, E with X, S with Y, etc.

AndS. · Answer 7 · 2018-09-07T12:29:44.533

Here is another variation on a theme.

library(tidyverse)

test.table %>% 
  nest(-id) %>% 
  mutate(letters = map(data, ~str_split(.x$sequence,'') %>% unlist()),
         numbers = map(letters, ~1:length(.x))) %>%
  unnest(letters, numbers)
#> # A tibble: 54 x 3
#>       id letters numbers
#>    <int> <chr>     <int>
#>  1     1 H             1
#>  2     1 E             2
#>  3     1 L             3
#>  4     1 L             4
#>  5     1 O             5
#>  6     1 T             6
#>  7     1 H             7
#>  8     1 I             8
#>  9     1 S             9
#> 10     1 I            10
#> # ... with 44 more rows

or slightly different to avoid 2 calls to map

test.table %>% 
  nest(-id) %>% 
  mutate(newdata = map(data, ~data_frame(
    letters = str_split(.x$sequence, "") %>% unlist(),
    numbers = 1:str_count(.x$sequence)))) %>%
  unnest(newdata)
#> # A tibble: 54 x 3
#>       id letters numbers
#>    <int> <chr>     <int>
#>  1     1 H             1
#>  2     1 E             2
#>  3     1 L             3
#>  4     1 L             4
#>  5     1 O             5
#>  6     1 T             6
#>  7     1 H             7
#>  8     1 I             8
#>  9     1 S             9
#> 10     1 I            10
#> # ... with 44 more rows

Created on 2018-09-07 by the reprex package (v0.2.0).

Split string into single letters and remember position

7 Answers7