manipulate strings R to produce specific output

Question

I have character vectors like this

sol=c("119","911","*","ab","ba","*","*","abcd","bcda","abcd","cdab","abcd","dabc","*","*","*","*")

I want to take a vector at a time and produce an output as below.

What is the quickest way to do the same? Basically, I want to start a new line wherever there is *. If there are consecutive * then I want only one new line. Consecutive non * elements should be printed on a new line and in case of consecutive non * elements, if there is any repeating element then it shouldn't get printed

119 911
ab ba
abcd bcda cdab dabc

I am thinking of writing a for loop and printing elements till i encounter a *. But not sure how to ensure that how to treat consecutive * to produce a single new line and also not sure how to remove repeating element from a consecutive list of non * elements

score 2 · Accepted Answer · edited May 23 '17 at 11:58

2

Here's an attempt, based on cumsum-ing the cases that match *:

lapply(split(sol[sol!="*"],cumsum(sol=="*")[sol!="*"]),unique)
#$`0`
#[1] "119" "911"
# 
#$`1`
#[1] "ab" "ba"
#
#$`3`
#[1] "abcd" "bcda" "cdab" "dabc"

You could then write this out to a text file using: R: Print list to a text file

edited May 23 '17 at 11:58

Community

1
1

answered Feb 23 '15 at 02:26

thelatemail

91,185
12
128
188

thanks. I added a line - writeLines(unlist(lapply(mylist, paste, collapse=" "))) provided in the link that you had given and i get what i am looking for – user2543622 Feb 23 '15 at 02:36
would it be possible to explain how above line works? I understand lapply and split part. I am confused about `cumsum(sol=="*")[sol!="*"]` part. I know that `(sol=="*")` returns true, false list based upon present of character * in sol, `sol!="*"]` does the exact opposite and cumsum is cumulative sum fuction. But i am confuse with how these things work when they are together..what is the order in which each of them get resolved? – user2543622 Aug 21 '15 at 18:39
@user2543622 - when you cumsum the vector matching * you get a counter that increases by 1 each time * is hit. That means in each group of non-* you have a different constant counter value. Because split is operating on a vector that removes the * values, this same selection needs to happen for the actual group variable too. Hence the subsetting of the cumsum part. Try breaking the code down and running each section to see how it works. – thelatemail Aug 21 '15 at 19:51
`cumsum(sol=="*")` returns `[1] 0 0 1 1 1 2 3 3 3 3 3 3 3 4 5 6 7 8 9 10 11`. While `solution!="*"` returns `[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE`. What does `cumsum(sol=="*")[sol!="*"]` do and how? What is the order of execution? – user2543622 Aug 21 '15 at 20:53
@user2543622 - the order of execution is the `cumsum` counter first, which will put a unique counter against a group bordered by `*` characters. Then this is subset using `[sol!="*"]` to keep only the non-`*` strings. Compare `cbind(sol,cumsum(sol=="*"))` and `cbind(sol[sol!="*"],cumsum(sol=="*")[sol!="*"])` to get an idea how this is necessary for the `split` to work appropriately. Here is another question where I explained almost the exact same issue: http://stackoverflow.com/a/27933328/496803 – thelatemail Aug 21 '15 at 22:38

score 1 · Answer 2 · answered Feb 23 '15 at 01:23

1

You could try the below,

> print(gsub("(?:\\s*\\*)+\\s*", "\\\n", paste(sol, collapse=" ")))
[1] "119 911\nab ba\nabcd bcda abcd cdab abcd dabc\n"

answered Feb 23 '15 at 01:23

Avinash Raj

172,303
28
230
274

it is close. But how could i get a new line instead of \n? As shown in my output in the question, I want each set of text on a new line. Also how could i get rid of second and third abcd from "abcd bcda abcd cdab abcd dabc"? – user2543622 Feb 23 '15 at 02:20
1

@user2543622 - `\n` is a new line. – thelatemail Feb 23 '15 at 02:27

manipulate strings R to produce specific output

2 Answers2