12

I've had a look for answers, but have only found things referring to C or C#. I realise that much of R is written in C but my knowledge of it is non-existent. I am also relatively new to R. I am using the current Rstudio.

This is similar to what I want, I think. Read the data efficiently with multiple separating lines in R

I have a csv file but one variable is a string with values separated by _ and - And I would like to know if there is a package or extra code which does the following on the read. command.

"1","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,218,4,93,1377907200000
"2","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,390,5,157,1377993600000
"3","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",0,376,5,193,1.37808e+12
"4","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",1,35,1,15,1377907200000
"5","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",12,11258,117,2843,1377993600000
"6","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",5,4659,56,1826,1.37808e+12
"7","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_ANDROID","2013-08-31 13:39:55.0","2013-10-16 13:58:00.0",7,7296,136,2684,1377907200000
"8","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,4533,35,1632,1377907200000
"9","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,421,6,161,1377993600000
"10","Client1","Name2","*Name3_Name1_KB_MobApp_M-13-44_AU_PI Likes by KB_IOS_IPAD","2013-08-31 13:18:21.0","2013-10-16 13:58:00.0",0,57,2,23,1.37808e+12

Example row:

Name    Name1   *XYZ_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000

So it's easy enough to read in

results <- read.delim("~/results", header=F)

but then I still have the string *XYZ_Name3_KB_MobApp_M-18-25_AU_PI

Desired output(separate by _ and by -):

Name    Name1   *XYZ   Name3  KB   MobApp   M 18 25  AU  PI ANDROID 2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000

but not split up the time string.

---- Thanks @Henrik and @AnandaMahto for the code and package. ----

library(splitstackshape)

# split concatenated column by `_`
df4 <- concat.split(data = df3, split.col = "V3", sep = "_", drop = TRUE)

# split the remaining concatenated part by `-`
df5 <- concat.split(data = df4, split.col = "V3_5", sep = "-", drop = TRUE)
Community
  • 1
  • 1
CArnold
  • 465
  • 4
  • 7
  • 16
  • I have the option of exporting again to csv and then putting into excel and using text to columns twice. but as I'm on excel 2010 it's with a limited # of rows. – CArnold Nov 19 '13 at 15:16
  • 1
    Have a look at `str_split` or `stringr::str_split_fixed` and see if that helps. – TheComeOnMan Nov 19 '13 at 15:20
  • Ah, so simple. Do you think I should do it it multiple steps then? Rather than on import. – CArnold Nov 19 '13 at 15:38
  • I'd do it right after import. I'll post snippet below. – hrbrmstr Nov 19 '13 at 15:48
  • you can specify more than one split character in strsplit using regex and | operator, e.g strsplit("*XYZ_Name3_KB_MobApp_M-18-25_AU_PI ANDROID",split="\\_|\\-") – ndr Nov 19 '13 at 15:57

3 Answers3

5

I find the functions in package splitstackshape convenient in cases like this.

library(splitstackshape)

# split concatenated column by `_`
results2 <- concat.split(data = results, split.col = "V3", sep = "_", drop = TRUE)

# split the remaining concatenated part by `-`
results3 <- concat.split(data = results2, split.col = "V3_5", sep = "-", drop = TRUE)
results3
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • I'm getting an "Error in FUN(NA_integer_[[1L]], ...) : argument must be coercible to non-negative integer" but thanks for the package I'll have look into making it work. – CArnold Nov 19 '13 at 15:56
  • OK. Possibly there are some characteristics of your original data which are not represented in the small sample in your question (which works fine, for me). Cheers. – Henrik Nov 19 '13 at 16:01
  • 1
    @ChristianArnold, as the package's author, I'd be interested in seeing some actual data that creates this error and the steps to reproduce it. Feel free to do so by [creating an issue at the package's Github issue tracker](https://github.com/mrdwab/splitstackshape/issues?state=open). Thanks! – A5C1D2H2I1M1N2O1R2T1 Nov 19 '13 at 16:04
3
library(stringr)

results <- read.delim("~/results", header=F)
results <- cbind(results,str_split_fixed(results$V3, "[_-]", 9))

(this assumes you're OK with having the original column still in place)

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
2

Try this:

# dummy data
df <- read.table(text="
Name    Name1   *XYZ_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
Name    Name2   *CCC_Name3_KB_MobApp_M-18-25_AU_PI ANDROID  2013-09-32 14:39:55.0   2013-10-16 13:58:00.0   0   218 4   93  1377907200000
", as.is = TRUE)

# replace "_" to "-"
df_V3 <- gsub(pattern="_", replacement="-", df$V3, fixed = TRUE)

# strsplit, make dataframe
df_V3 <- do.call(rbind.data.frame, strsplit(df_V3, split = "-"))

# output, merge columns
output <- cbind(df[, c(1:2)],
                df_V3,
                df[, c(4:ncol(df))])

Building on the comments below, here is another related option, but one which uses read.table instead of strsplit.

splitCol <- "V3"
temp <- read.table(text = gsub("-", "_", df[, splitCol]), sep = "_")
names(temp) <- paste(splitCol, seq_along(temp), sep = "_")
cbind(df[setdiff(names(df), splitCol)], temp)
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • @zx8754, two ideas: (1) If you're going to use the `strsplit` approach, use a regular expression and skip the `gsub` step, and maybe just use `do.call(rbind, ...)` since (I *think*) `rbind.data.frame` is slower (and it gives you funky names). (2) If you're going to use the `gsub` approach, forget about `strsplit` and use `read.table(text = df_V3, sep = "-")`. – A5C1D2H2I1M1N2O1R2T1 Nov 19 '13 at 15:57
  • 1
    But +1 for an answer that should at least point the OP in the right direction ;-) – A5C1D2H2I1M1N2O1R2T1 Nov 19 '13 at 15:59
  • I would upvote if I had enough reputation points. But sadly not yet. – CArnold Nov 19 '13 at 16:05
  • 2
    @ChristianArnold, Edit your question with some reproducible data and some examples of what you've tried, and people are sure to give you more up-votes on your question, which in turn will let you vote on answers ;-) – A5C1D2H2I1M1N2O1R2T1 Nov 19 '13 at 16:07
  • @AnandaMahto agree, code is *a bit* messy, intention was to direct the OP in the right direction, feel free to edit. – zx8754 Nov 19 '13 at 16:08
  • @AnandaMahto, I've put in 10 lines of data, but will that help? Also, I retried with your package and it works good now! :) – CArnold Nov 19 '13 at 16:57