1

Given an existing dataframe containing a character column such as that shown below (oldColumn1), I want to have R automatically create a new column, in the same data frame, named from the left part of the string (e.g. COLOR).

Then for each row put the right part of the string contents appearing after the ": " (e.g. RED, BLUE, ETC) into the new column named "COLOR".

There are many old columns (oldColumn1, oldColumn2, etc) that need to be split out like this so doing this manually is impractical. Thanks in advance for any help you might provide.

# Here is an example of 3 oldColumns that already exist in dataframe.
# There are thousands of these columns, need to auto create a new
# column for each one as described.
# Maybe hoping to have the oldColumn names in a vector, to then pass
# to a function that creates a new column for each oldColumn. 

oldColumn1 <- c('COLOR: RED', 'COLOR: RED', 'COLOR: BLUE', 'COLOR: GREEN', 'COLOR: BLUE')
oldColumn2 <- c('SIZE: LARGE', 'SIZE: MEDIUM','SIZE: XLARGE','SIZE: MEDIUM','SIZE: SMALL')
oldColumn3 <- c('DESIGNSTYLE: STYLED', 'DESIGNSTYLE: ORIGINAL MAKER', 'DESIGNSTYLE: COUTURE','DESIGNSTYLE: COUTURE','DESIGNSTYLE: STYLED')
COLOR <- c('RED', 'RED', 'BLUE', 'GREEN', 'BLUE')
SIZE <- c('LARGE', 'MEDIUM', 'XLARGE', 'MEDIUM', 'SMALL')
DESIGNSTYLE <- c('STYLED', 'ORIGINAL MAKER', 'COUTURE', 'COUTURE', 'STYLED')
dat <- data.frame(oldColumn1, oldColumn2, oldColumn3, COLOR, SIZE, DESIGNSTYLE)
dat
seehorse
  • 53
  • 5

2 Answers2

2

You can create a new column by using $ and then use gsub() to remove COLOR: from the targeted column.

yourdf$COLOR <- gsub("COLOR: ", "", yourdf$oldColumn1)

If you also want to delete the old column:

yourdf$oldColumn1 <- NULL

EDIT

In case you have many columns, you can apply gsub function to all targeted columns. If your targeted columns have a common name pattern, such as oldColumn as in your example, you can subset your data frame by identifying that pattern by using grep. After that, you can rename the edited columns to become COLOR1, COLOR2, etc.

Here are the complete steps:

# Remove "COLOR: " from the targeted columns
colname_pattern <- grep("oldColumn", colnames(yourdf))
yourdf[, colname_pattern] <- apply(yourdf[, colname_pattern], 2, 
                                   gsub, pattern = "COLOR: ", 
                                   replacement = "")
# Rename the edited columns
index <- seq_along(colname_pattern)
newnames <- paste0("COLOR", index)
colnames(yourdf[, colname_pattern]) <- newnames
Abdur Rohman
  • 2,691
  • 2
  • 7
  • 12
1

Starting with

quux <- structure(list(oldColumn1 = c("COLOR: RED", "COLOR: RED", "COLOR: BLUE", "COLOR: GREEN", "COLOR: BLUE")), class = "data.frame", row.names = c(NA, -5L))

The naive approach would be

data.frame(COLOR = trimws(sub("COLOR:", "", quux$oldColumn1)))
#   COLOR
# 1   RED
# 2   RED
# 3  BLUE
# 4 GREEN
# 5  BLUE

But I'm assuming you have a more generic need. Let's assume that you have some more things to parse out of that, such as

quux <- structure(list(oldColumn1 = c("COLOR: RED", "COLOR: RED", "COLOR: BLUE", "COLOR: GREEN", "COLOR: BLUE", "SIZE: 1", "SIZE: 3", "SIZE: 5")), class = "data.frame", row.names = c(NA, -8L))
quux
#     oldColumn1
# 1   COLOR: RED
# 2   COLOR: RED
# 3  COLOR: BLUE
# 4 COLOR: GREEN
# 5  COLOR: BLUE
# 6      SIZE: 1
# 7      SIZE: 3
# 8      SIZE: 5

then we can generalize it with

tmp <- strcapture("(.*)\\s*:\\s*(.*)", quux$oldColumn1, list(k="", v=""))
tmp$ign <- ave(rep(1L, nrow(tmp)), tmp$k, FUN = seq_along)
reshape2::dcast(tmp, ign ~ k, value.var = "v")[,-1,drop=FALSE]
#   COLOR SIZE
# 1   RED    1
# 2   RED    3
# 3  BLUE    5
# 4 GREEN <NA>
# 5  BLUE <NA>

--

Edit: alternative with updated data:

do.call(cbind, lapply(dat, function(X) {
  nm <- sub(":.*", "", X[1])
  out <- data.frame(trimws(sub(".*:", "", X)))
  names(out) <- nm
  out
}))
#   COLOR   SIZE    DESIGNSTYLE
# 1   RED  LARGE         STYLED
# 2   RED MEDIUM ORIGINAL MAKER
# 3  BLUE XLARGE        COUTURE
# 4 GREEN MEDIUM        COUTURE
# 5  BLUE  SMALL         STYLED
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Any ideas for developing a function that could be called, passing in an oldColumn name vector, without having to edit code for each column? – seehorse Apr 12 '22 at 21:42
  • No I do not, mostly because I don't know what you mean exactly by "each column". I suggest you [edit](https://stackoverflow.com/posts/71849180/edit) your question and provide sample data that supports what you're asking for here. (And please, in a format that is more easily used such as `data.frame(...)`, `read.table(...)`, or `dput(...)`. See https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info.) – r2evans Apr 12 '22 at 21:44
  • I didn't see your edit (we are not automatically notified, fyi). See my edit, hope it works. – r2evans Apr 13 '22 at 23:43