0

I want to be able to extract specific characters from a character vector in a data frame and return a new data frame. The information I want to extract is auditors remark on a specific company's income and balance sheet. My problem is that the auditors remarks are stored in vectors containing the different remarks. For instance:

vec = c("A C G H D E"). Since "A" %in% vec won't return TRUE, I have to use strsplit to break up each character vector in the data frame, hence "A" %in% unlist(strsplit(dat[i, 2], " "). This returns TRUE.

Here is a MWE:

dat <- data.frame(orgnr = c(1, 2, 3, 4), rat = as.character(c("A B C")))
dat$rat <- as.character(dat$rat)
dat[2, 2] <- as.character(c("A F H L H"))
dat[3, 2] <- as.character(c("H X L O"))
dat[4, 2] <- as.character(c("X Y Z A B C"))

Now, to extract information about every single letter in the rat coloumn, I've tried several approaches, following similar problems such as Roland's answer to a similar question (How to split a character vector into data frame?)

DF <- data.frame(do.call(rbind, strsplit(dat$rat, " ", fixed = TRUE)))
DF
   X1 X2 X3 X4 X5 X6
1  A  B  C  A  B  C
2  A  F  H  L  H  A
3  H  X  L  O  H  X
4  X  Y  Z  A  B  C

This returnsthe following error message: Warning message: In (function (..., deparse.level = 1) : number of columns of result is not a multiple of vector length (arg 2)

It would be a desirable approach since it's fast, but I can't use DF since it recycles. Is there a way to insert NA instead of the recycling because of the different length of the vectors?

So far I've found a solution to the problem by using for-loops in combination with ifelse-statements. However, with 3 mill obs. this approach takes years!

dat$A <- 0

for(i in seq(1, nrow(dat), 1)) {
  print(i)
  dat[i, 3] <- ifelse("A" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}

dat$B <- 0

for(i in seq(1, nrow(dat), 1)) {
  print(i)
  dat[i, 4] <- ifelse("B" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}

This gives the results I want:

dat
  orgnr         rat A B
1     1       A B C 1 1
2     2   A F H L H 1 0
3     3     H X L O 0 0
4     4 X Y Z A B C 1 1

I've searched through most of the relevant questions I could find here on StackOverflow. This one is really close to my problem: How to convert a list consisting of vector of different lengths to a usable data frame in R?, but I don't know how to implement strsplit with that approach.

cs95
  • 379,657
  • 97
  • 704
  • 746
user6149765
  • 15
  • 1
  • 4

2 Answers2

1

We can use for-loop with grepl to achieve this task. + 0 is to convert the column form TRUE or FALSE to 1 or 0

for (col in c("A", "B")){
  dat[[col]] <- grepl(col, dat$rat) + 0
}
dat
#   orgnr         rat A B
# 1     1       A B C 1 1
# 2     2   A F H L H 1 0
# 3     3     H X L O 0 0
# 4     4 X Y Z A B C 1 1

If performance is an issue, try this data.table approach.

library(data.table)

# Convert to data.table
setDT(dat)

# Create a helper function
dummy_fun <- function(col, vec){
  grepl(col, vec) + 0
}

# Apply the function to A and B
dat[, c("A", "B") := lapply(c("A", "B"), dummy_fun, vec = rat)] 
dat
#    orgnr         rat A B
# 1:     1       A B C 1 1
# 2:     2   A F H L H 1 0
# 3:     3     H X L O 0 0
# 4:     4 X Y Z A B C 1 1
www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for your fast respons, @WWW! When I'm trying to run your solution using the data.table approach, I get the following warning message: Warning messages: 1: In grepl(col, vec) : argument 'pattern' has length > 1 and only the first element will be used Instead of using c("A", "B") I use a vector with all the different letters there are in (19 different letters). Do you know how to overcome this problem? (I also get the error message that the size is too big, "cannot allocate vector", but that is expected. I will try on a more powerfull computer later). – user6149765 Mar 20 '18 at 15:19
  • Not sure why you got the first error message. If you apply the `dummy_fun` to each string on by one using `lapply`, you should have only one pattern at a time. – www Mar 20 '18 at 18:40
  • 1
    Somehow your approach didn't work when I applied a stored vector of letters (vec). I had to put in all the letters the way you did in your example. Also, using the data table approach was extremely fast on over 3 mill observation. Thanks a lot! – user6149765 Mar 21 '18 at 09:35
0

using Base R:

a=strsplit(dat$rat," ")
b=data.frame(x=rep(dat$orgnr,lengths(a)),y=unlist(a),z=1)
cbind(dat,as.data.frame.matrix(xtabs(z~x+y,b)))
  orgnr         rat A B C F H L O X Y Z
1     1       A B C 1 1 1 0 0 0 0 0 0 0
2     2   A F H L H 1 0 0 1 2 1 0 0 0 0
3     3     H X L O 0 0 0 0 1 1 1 1 0 0
4     4 X Y Z A B C 1 1 1 0 0 0 0 1 1 1

From here you can Just call those columns that you want:

d=as.data.frame.matrix(xtabs(z~x+y,b))
 cbind(dat,d[c("A","B")])
  orgnr         rat A B
1     1       A B C 1 1
2     2   A F H L H 1 0
3     3     H X L O 0 0
4     4 X Y Z A B C 1 1
Onyambu
  • 67,392
  • 3
  • 24
  • 53