I need to copy numbers between two text strings in column, "transpose" it to columns next to the first text string

Question

I need to copy numbers between two text strings in column, "transpose" it to columns next to the first text string and repeat through the rest of the column. Dataset has hundreds of rows.

The "alignment" (like putting NAs for correcting positions of numbers) is not necessary, transposing would be more than enough.

input:

df1 <- structure(list(column1 = c("STOUT", "18", "9341", "4", "0,2005", 
"STOUT", "1", "9341", "25", "0,2004", "STIN", "7", "9341", "0,2003", 
"OFF", "7", "L(1)", "9342", "0,2005")), class = "data.frame", row.names = c(NA, 
-19L))

> print(df)
       column1
    1    STOUT
    2       18
    3     9341
    4        4
    5   0,2005
    6    STOUT
    7        1
    8     9341
    9       25
    10  0,2004
    11    STIN
    12       7
    13    9341
    14  0,2003
    15     OFF
    16       7
    17    L(1)
    18    9342
    19  0,2005

desired output:

df2 <- structure(list(column1 = c("STOUT", "STOUT", "STIN", "OFF", "L(1)"
), column2 = c(18L, 1L, 7L, 7L, NA), column3 = c(9341L, 9341L, 
9341L, NA, 9342L), column4 = c(4L, 25L, NA, NA, NA), column5 = c(0.2005, 
0.2004, 0.2003, NA, 0.2005)), class = "data.frame", row.names = c(NA, 
-5L))

> print(df2)
  column1 column2 column3 column4 column5
1   STOUT      18    9341       4  0.2005
2   STOUT       1    9341      25  0.2004
3    STIN       7    9341      NA  0.2003
4     OFF       7      NA      NA      NA
5    L(1)      NA    9342      NA  0.2005

I was thinking along Extracting a string between other two strings in R

But did not make much progress :-/

Thanks in advance.

the problem I see is that the NAs of the result set are required to be available in the start data set. One could think of distinguishing by number size though column2 and column4 are too similar for that (plus we do not know the whole dataset). It is possible to parse by sequence into columns using a field with at least one character as starting point of a line but the result in terms of column content will differ from your desired output. — DPH, Jun 16 '22 at 12:13

Skaqqs · Accepted Answer · 2022-06-16T12:55:15.973

Split on a regex-based logical:

(I defined the break points based on the presence of capital letters (i.e. [A-Z]); you may want to modify the pattern based on your expected break points in df1$column1.)

a <- split(df1$column1, cumsum(grepl('[A-Z]', df1$column1)))
a
$`1`
[1] "STOUT"  "18"     "9341"   "4"      "0,2005"

$`2`
[1] "STOUT"  "1"      "9341"   "25"     "0,2004"

$`3`
[1] "STIN"   "7"      "9341"   "0,2003"

$`4`
[1] "OFF" "7"  

$`5`
[1] "L(1)"   "9342"   "0,2005"

Then rbind() and fill with NA:

(plyr::rbind.fill() expects a dataframe, so I'm using lapply() to call as.data.frame() to each list element.)

library(plyr)

plyr::rbind.fill(lapply(a,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
     V1   V2     V3     V4     V5
1 STOUT   18   9341      4 0,2005
2 STOUT    1   9341     25 0,2004
3  STIN    7   9341 0,2003   <NA>
4   OFF    7   <NA>   <NA>   <NA>
5  L(1) 9342 0,2005   <NA>   <NA>

good idea of using `cumsum` + `grepl`, +1! – ThomasIsCoding Jun 16 '22 at 12:51 — ThomasIsCoding, Jun 16 '22 at 12:51
Again, works for my full dataset. Thanks a lot! – vladimir mazoch Jun 16 '22 at 13:31 — vladimir mazoch, Jun 16 '22 at 13:31

score 2 · Answer 2 · answered Jun 16 '22 at 12:49

Here is a base R option using read.csv + by + grepl to make it

df2 <- with(
    df1,
    read.csv(
        text = paste0(
            by(column1, cumsum(grepl("^\\D", column1)), toString),
            collapse = "\n"
        ),
        header = FALSE
    )
)

which gives

> df2
     V1   V2   V3   V4   V5   V6
1 STOUT   18 9341    4    0 2005
2 STOUT    1 9341   25    0 2004
3  STIN    7 9341    0 2003   NA
4   OFF    7   NA   NA   NA   NA
5  L(1) 9342    0 2005   NA   NA

> str(df2)
'data.frame':   5 obs. of  6 variables:
 $ V1: chr  "STOUT" "STOUT" "STIN" "OFF" ...
 $ V2: int  18 1 7 7 9342
 $ V3: int  9341 9341 9341 NA 0
 $ V4: int  4 25 0 NA 2005
 $ V5: int  0 0 2003 NA NA
 $ V6: int  2005 2004 NA NA NA

Works for my larger dataset. Thanks a lot. – vladimir mazoch Jun 16 '22 at 13:31 — vladimir mazoch, Jun 16 '22 at 13:31

score 1 · Answer 3 · answered Jun 16 '22 at 15:02

Another possible solution, based on tidyverse:

library(tidyverse)

df1 %>% 
  group_by(aux = cumsum(str_detect(column1, "[^\\d|,]"))) %>% 
  mutate(x = rep(first(column1), n())) %>%
  filter(!str_detect(column1, "[^\\d|,]")) %>%
  mutate(name = paste0("column", 2:(n()+1))) %>% 
  ungroup %>% 
  pivot_wider(c(aux, x), values_from = column1) %>% 
  select(-aux, column1 = x)

#> # A tibble: 5 × 5
#>   column1 column2 column3 column4 column5
#>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 STOUT   18      9341    4       0,2005 
#> 2 STOUT   1       9341    25      0,2004 
#> 3 STIN    7       9341    0,2003  <NA>   
#> 4 OFF     7       <NA>    <NA>    <NA>   
#> 5 L(1)    9342    0,2005  <NA>    <NA>

I need to copy numbers between two text strings in column, "transpose" it to columns next to the first text string

3 Answers3