detecting common characters in a list of strings, at specific positions

Question

I have a very large dataset with many columns, exported from an application. The issue is that the file is "empty character" separated. Reading in the file with readLines yields a list of strings, each string with the same number of characters.

A possible way of determining the positions of the columns is checking whether each string, let s say at position 5, has an empty character. Therefore, can keep on searching starting from vector 1 until a non-empty character is found.

d <- data.frame("V1" = c(" f ggh", "aa hh", "a  qq" ), stringsAsFactors = 
F)


first.char <- function(col){
current <- 0
j <- 1
while(j <= length(d)){
tmp <- substr(d[j], col, col)
if(!grepl("^\\s*$", tmp)){
  current <- 1
  break}
j <- j+1
}
return(current)
}

row_dummies <- lapply( c(1:6), first.char) %>% unlist

This approach works but is very slow when scaling up (have a list of 1 million strings, each 1500 characters long). I also tried to transform each vector into a data.table and then use str split (Split text string in a data.table columns), but that seems to be even more inefficient, given that in most of the cases it is not necessary to check all rows.

Any suggestions or advice?

UPDATE: The example above is too trivial. This one is a bit better:

text <- c("df ggh a a h h a  qq",
          "       aa  hh  ab qt",
          " fggh   aa hh  a    ")

The desired output is

 list( c("df ggh", "a a", "h h", "a",   "qq"),
       c(NA,       "aa",  "hh",  "ab",  "qq"),
       c(" fggh",  "aa",  "hh",  "a",   NA)

 )

str_locate_all works well, as it indicates where to split the strings:

cuts_in <- sapply(text, function(x) x %>%  str_locate_all(. , "\\s") ) 
cuts_in <- lapply(cuts_in,  data.table) # to data.table    
cuts_in <- rbindlist(cuts_in)
cuts_in <- cuts_in[, .N, by=start] 
cuts_in[ N==3 ,"start"]

   start
   1:     7
   2:    11
   3:    15
   4:    18

However, probably not the most efficient way (have 15 files, one million rows per file, each row has 1500 characters). For instance, there is no need to check the character at position 1 in rows 2 and 3 given that row 1 position 1 is not a space. read_table2 does not seem to be a solution either:

read_table2(text, col_names = FALSE)

  X1    X2    X3    X4    X5    X6    X7    X8   
 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 df    ggh   a     a     h     h     a     qq   
2 aa    hh    ab    qt    NA    NA    NA    NA   
3 fggh  aa    hh    a     NA    NA    NA    NA

With the example `read.table(text = d$V1, as.is = TRUE)` works. — G. Grothendieck, Apr 18 '19 at 16:54
yes, my example turns out to be too simple. One possible difference might be that I cant split at every empty character. Using read.table I get the following: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 79 elements — desval, Apr 18 '19 at 17:12
I think I'm a bit confused about the desired output; having the same number of characters per line doesn't mean you will have the same number of columns because of the different numbers of spaces. What do you want instead of what `read_table2` is giving you? — Calum You, Apr 18 '19 at 19:23
I don't know the number of columns ex-ante. I need to find that out, and also find out where they are. Because of how the data is exported, if a cell (in the database) is an NA, it is printed as " " in my txt file. This implies that I can get an idea about the columns by looking at " " column-wise. In the example above, the column split candidates are therefore at position 7, 11, 15, 18. — desval, Apr 18 '19 at 19:34

Calum You · Accepted Answer · 2019-04-19T01:13:23.617

The situation you actually find yourself in is that you need to read a fixed width file and don't know where the columns are, which I didn't understand previously. You can try using readr::read_fwf for this. fwf_empty will look at some lines, by default 100, and try to figure out where the intersecting columns are. Depending on how many columns you expect to have in 1500 characters, you might need to increase n to get the right output.

library(tidyverse)
text <- c("df ggh a a h h a  qq",
          "       aa  hh  ab qt",
          " fggh   aa hh  a    ")
read_fwf(text, fwf_empty(text, n = 100))
#> # A tibble: 3 x 5
#>   X1     X2    X3    X4    X5   
#>   <chr>  <chr> <chr> <chr> <chr>
#> 1 df ggh a a   h h   a     qq   
#> 2 <NA>   aa    hh    ab    qt   
#> 3 fggh   aa    hh    a     <NA>

Alternatively, if you are already using str_locate_all and want to look at all the lines, you can convert the resulting locations into widths to use with fwf_widths by adding the start and end points and taking the differences. Note that you do not need to use sapply with str_locate_all, it's already vectorised. This is likely to be slower because it checks every row, and if you are not getting the correct output I would try increasing n first.

locations <- text %>%
  str_locate_all("\\s") %>%
  map(~.[, 1]) %>%
  reduce(intersect)

widths <- c(1, locations, str_length(text[1])) %>% diff()

read_fwf(text, fwf_widths(widths))
#> # A tibble: 3 x 5
#>   X1     X2    X3    X4    X5   
#>   <chr>  <chr> <chr> <chr> <chr>
#> 1 df ggh a a   h h   a     q    
#> 2 <NA>   aa    hh    ab    q    
#> 3 fggh   aa    hh    a     <NA>

^{Created on 2019-04-18 by the reprex package (v0.2.1)}

Updated now that I understand the question better – Calum You Apr 19 '19 at 01:13 — Calum You, Apr 19 '19 at 01:13

Ben G · Answer 2 · 2019-04-18T18:11:02.330

1

How about str_locate_all from stringr:

library(stringr)

d <- data.frame("V1" = c(" f ggh", "aa hh", "a  qq" ), stringsAsFactors = 
F)

str_locate_all(d$V1, "\\s")


[[1]]
     start end
[1,]     1   1
[2,]     3   3

[[2]]
     start end
[1,]     3   3

[[3]]
     start end
[1,]     2   2
[2,]     3   3

But if you're trying to split it into different columns you can use a combo of dplyr and tidyr to do it all at once.

library(tidyverse)

d %>%
 mutate(V1 = str_trim(V1, side = "both")) %>%
 separate(V1, c("string_1", "string_2"), sep = "\\s+")

  string_1 string_2
1        f      ggh
2       aa       hh
3        a       qq

edited Apr 18 '19 at 18:11

answered Apr 18 '19 at 17:10

Ben G

4,148
2
22
42

yes, works, then can split based on positions retrieved from intersecting the results. thanks – desval Apr 18 '19 at 17:19
Updated the answer so that you split it as well with someone different functions. Feel free to mark as answered so the question is closed ;) – Ben G Apr 18 '19 at 18:11

detecting common characters in a list of strings, at specific positions

2 Answers2