-3

I have a dataframe that has string columns - each of these columns is of format "xyz:x-dffh, dddd and stgL-fhgdf,"

I need to split at the word "and" - rest should be as is

Input is a dataframe with 2 such columns - output will be for each column in input multiple output columns

Is this doable in R? In excel I would use text to columns -

  • 3
    Welcome to SO. Please provide a [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) as it makes answering your question a lot easier. – geotheory Jul 26 '13 at 08:45
  • 1
    You want to use `strsplit`. More detailed answers will require you to supply `dput(head(input))` where `input` is your dataframe. – Thomas Jul 26 '13 at 08:49

4 Answers4

2

If 'df' is your dataframe, you can try creating two new columns from the original column you want to split adapting the following code to your data:

df$newColumn1 <- lapply(strsplit(as.character(df$originalColumn), "and"), "[", 1)
df$newColumn2 <- lapply(strsplit(as.character(df$originalColumn), "and"), "[", 2)
bmartinez
  • 31
  • 4
  • 1
    I don't think it is a good idea to assign a list to a data.frame column. – Roland Jul 26 '13 at 09:36
  • @Roland, just curious--why not? I agree it's not the most convenient data format to work with, but some of base R's functions do so in common operations (like `aggregate`, on occasion). – A5C1D2H2I1M1N2O1R2T1 Jul 26 '13 at 11:04
  • The main reason is that it leads to an uncommon data structure, which can make code confusing. – Roland Jul 26 '13 at 11:09
1

You could try the following in base R (similar to bmartinez'z answer without the assignment of list to dataframe):

df <- data.frame(originalColumn = c("dog and cat", "robots and raptors"))

do.call(rbind.data.frame, strsplit(as.character(df$originalColumn), "and"))

## > do.call(rbind.data.frame, strsplit(as.character(df$originalColumn), "and"))
##   c..dog.....robots... c...cat.....raptors..
## 1                 dog                    cat
## 2              robots                raptors

Or using the qdap package:

library(qdap)
colsplit2df(df, sep = "and")


## > colsplit2df(df, sep = "and")
##        X1       X2
## 1    dog       cat
## 2 robots   raptors
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

Here is what worked for me - using inputs from above and various other threads on SO. I am a complete newbie to R and my objective is to migrate work from excel to R.

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

#--------------------------------------------------------------------------------
# OBJECTIVE - migrate this activity from excel + VBA to R
#
# split and find out max cols required - each element in dataframe is a list of
#variable length - objective is to convert it into individual columns with number of 
#columns = maximum size of list - for the rows with less number of entries the
#additional columns will contain "NA"
---------------------------------------------------------------------------------

temp_split<-strsplit(src.df$PREV,"and")
max_col=max(unlist(lapply(temp_split,length),recursive=TRUE))

# add to dataframe with fixed row and max_col
# keep columns empty - if no data

add_list <- function (x,max_col){
u_l <- unlist(x)
l<-length(unlist(x))
pad_col = max_col - l
r_l <- c(u_l, rep("NA",pad_col))
return(r_l)
}

test<-lapply(temp_split,add_list,max_col)
test_matrix<-data.frame(matrix(unlist(test,recursive=TRUE),nrow=NROW(src.df),byrow=T))

t.df<-test_matrix
c.df<-cbind(src.df,t.df)
0

This is a slight modification on the excellent answer provided by Tyler Rinker to solve a nearly identical problem. What if you wanted to separate the df into columns based on a space (similar to text to columns in excel)?

Try this:
df <- data.frame(originalColumn = c("dog and cat", "robots and raptors")) dfSpace<-do.call(rbind.data.frame, strsplit(as.character(df[,1]), " ")) dfSpace

make sure you and a space between the quotation marks.

feldhauj
  • 1
  • 1
  • that is indeed nearly identical... and not an answer to the question... and not valid because you've put `dfSpace` randomly at the end of the 2nd line... and not well formatted... – Hack-R Feb 12 '16 at 01:51