2

today I have faced a problem that I couldn't solve on my own, despite searching for solutions - it appeared to me, that either my approach is wrong or noone before asked similar question.

I'm playing around with Markov attribution, so I've got columns with strings that look like that:

A > B > B > C > B > A > C > B > A

etc.

...it is created on base of postgresql function 'string_agg'.

What I think would be important for me is assigning a number of for which time each string appears in entire string. To make it clear, at the end of the day, it would look like this:

A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3

There are three main challenges:

  • there are around 100 different types of elements to be counted that may change it time, so it makes it hard to hardcode it,
  • the dataset is around 200k rows,
  • strings may be up to few hundred characters long

The only thing that came up to my mind is to write some sort of loop, but it feels like it would take up ages until it finishes.

I also thought about solving it on postgresql level, but couldn't find efficient and easy solution to it neither.

Marcin
  • 137
  • 1
  • 10
  • Reproducible data would help us help you. – s_baldur Nov 13 '18 at 11:33
  • Are the `>` part of the string? – Rui Barradas Nov 13 '18 at 11:37
  • Unfortunately I cannot share company's data, but it basically looks like the one provided in example - if there is some public dataset that I can play around with I would be glad to provide you with some examples. And yes, " > " is part of string, however it can be changed into any character, i.e. spacebar. – Marcin Nov 13 '18 at 11:39

3 Answers3

2

Here is a rough example using data.table:

library(data.table)

# Example data:
data <- data.table(
  s = c("A > B > B > C > B > A > C > B > A",
        "A > B > B > C > B > A > C > B > C > D")
)

# Processing steps (can probably be shortened)
n <- strsplit(data[["s"]], " > ")
datal <- melt(n)
setDT(datal)
datal[, original_order := 1:.N
      ][, temp := paste0(value, 1:.N), by = .(L1, value)
        ][order(original_order), paste(temp, collapse = " > "), by = L1]


# Output:
   L1                                              V1
1:  1      A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3
2:  2 A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > C3 > D1
s_baldur
  • 29,441
  • 4
  • 36
  • 69
2

Here is a function that uses base R only.
Note that if you are using a diferent set of regex metacharacters, it should be easy to have a function argument metachar, defaulting to the one in the function body.

count_seq <- function(x, sep = ">"){
  metachar <- '. \ | ( ) [ { ^ $ * + ?'
  sep2 <- if(grepl(sep, metachar)) paste0("\\", sep) else sep
  y <- unlist(strsplit(x, sep2))
  y <- trimws(y)
  z <- ave(y, y, FUN = seq_along)
  paste(paste0(y, z), collapse = sep)
}

x <- "A > B > B > C > B > A > C > B > A"

count_seq(x)
#[1] "A1>B1>B2>C1>B3>A2>C2>B4>A3"
count_seq(x, sep = " > ")
#[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"

y <- "A | B | B | C | B | A | C | B | A"
count_seq(y, sep = "|")
#[1] "A1|B1|B2|C1|B3|A2|C2|B4|A3"
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Nice. Finally you would do something like: `sapply(column, count_seq, " > ")`, right? – s_baldur Nov 13 '18 at 11:55
  • @snoram Yes, that's the idea. I thought of looping (`*apply`) in the function but I it might be better to leave it like this, more general, and let the user decide. – Rui Barradas Nov 13 '18 at 11:59
  • 2
    In a similar vein using @snoram's 's': `lapply(strsplit(s, " > "), function(x) paste0(x, ave(x, x, FUN = seq_along), collapse = " > "))` – Henrik Nov 13 '18 at 12:11
1

How to do this is described in the gsubfn vignette. Using the code there first we define a proto object pword with methods pre and fun. pre initializes the word list (which stores the current count for each word encountered) and fun updates it each time a new word is encountered and also suffixes the word with the count returning the suffixed word.

Having defined the foregoing, run gsubfn using pword. For each component of the input gsubfn will first run pre and then for each match of the regular expression \\w+ gsubfn will input the match to fun, run fun and replace the match with the output of fun.

We have assumed that the words to be suffixed with a count are matched by \w+ which is the case for the example in the question but if your actual data is different you may need to change the pattern.

library(gsubfn)
s <- rep("A > B > B > C > B > A > C > B > A", 3) # sample input

pwords <- proto(
  pre = function(this) { this$words <- list() },
  fun = function(this, x) {
    if (is.null(words[[x]])) this$words[[x]] <- 0
    this$words[[x]] <- this$words[[x]] + 1
    paste0(x, words[[x]])
  }
)

gsubfn("\\w+", pwords, s)

giving:

[1] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[2] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
[3] "A1 > B1 > B2 > C1 > B3 > A2 > C2 > B4 > A3"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your answer, among given solutions this one was best at solving my problem and provided me a tool that helped me in dealing with my task, in pair with great explanation. Now I'm only fighting with writing a regexp that would ignore punctuation marks, as sometimes my words contain dots and hyphens and then the repetition is counted in wrong manner, I didn't take that under consideration when providing you with my poor example. – Marcin Nov 13 '18 at 14:12
  • 1
    This will match anything other than space or >: `[^ >]+` . – G. Grothendieck Nov 13 '18 at 14:13