Inserting delimiter before nth uppercase letter in R string

Question

I currently have a dataframe of imported CSV data. It's a list of first and last names, jobs titles, and company name. Each entry is on a separate row. The first and last names, job title, and company name are all capitalized.

Each row is in this format:

First LastTitle, Company

I want to insert a comma delimiter before "Title", so that I can then sort the data into three columns, like the second answer on this quesetion: splitting comma separated mixed text and numeric string with strsplit in R.

Essentially, in this specific case I want to locate the 3rd uppercase letter in each string, and then insert a comma delimiter before it.

This answer shows how to split a string on uppercase letters, but seems to only find the first uppercase letter: Splitting String based on letters case.

Any suggestions are appreciated.

score 3 · Accepted Answer · answered Aug 13 '15 at 19:17

3

Split the string into character vector and then use grep to find the positions of the upper case letters, then take the third position.

str <- "First LastTitle, Company"
tmp_str <- unlist(strsplit(str, ""))
ind <- grep("[A-Z]", tmp_str)[3]
paste0(c(tmp_str[1:(ind-1)], ",", tmp_str[ind:nchar(str)]), collapse="")
#[1] "First Last,Title, Company"

answered Aug 13 '15 at 19:17

mattdevlin

1,045
2
10
17

This works great. If I have a 3000 row data frame, what's the fastest way to run this on each row? Would I convert each row into a character vector? – ceph Aug 13 '15 at 20:00
You can use apply(df, 1, function(row) ... ). That applies a function to each row of a dataframe called df. – mattdevlin Aug 13 '15 at 20:13
I'm confused as to how to use apply() with the unlist function. `df2 = apply(df,1,unlist(strsplit(df, "")))` doesn't seem to work, nor does `df2 = apply(df,1,unlist(row(df,"")))` returning: `"Error in strsplit(row, "") : non-character argument"`. Could I incorporate both of your steps above into one apply() function? – ceph Aug 13 '15 at 21:32
@ceph The 3rd argument takes a function so in your case you might want `df2 <- apply(df,1,function(row) { tmp_str <- unlist(strsplit(str, "")); ind <- grep("[A-Z]", tmp_str)[3]; paste0(c(tmp_str[1:(ind-1)], ",", tmp_str[ind:nchar(str)]), collapse=""); }). Note that row doesn't have a special meaning, the 1 as the second argument means that the function is applied to each row of the dataset - I have chosen to call it 'row' but you could call it anything. It also assumes that your 'df' is just a single column. If this is not the form of your data, add an example of a few rows to your question. – mattdevlin Aug 13 '15 at 21:52
Ok, I understand now that "function" is used literally. My data is in a single column. When I ran your code though, I got: `Error in strsplit(str, "") : non-character argument`. I cleared my environment, and loaded in the csv with read.csv again, but had no luck. – ceph Aug 13 '15 at 22:09
R uses factors by default when importing data so you can either override this behavior using `options(stringsAsFactors=FALSE)` at the top of your script or use the `as.character()` function on your df i.e. `df[,1] <- as.character(df[,1])` – mattdevlin Aug 13 '15 at 22:25

score 1 · Answer 2 · answered Aug 13 '15 at 20:08

1

Try this:

gsub('([a-z])(?=[A-Z])','\\1,',str,perl=T)
[1] "First Last,Title, Company"

answered Aug 13 '15 at 20:08

Shenglin Chen

4,504
11
11

How would I use this with apply()? Would I have to convert each row into a string? – ceph Aug 13 '15 at 20:51

score 0 · Answer 3 · answered Aug 13 '15 at 19:26

0

You could insert a comma after two patterns of one uppercase-several none uppercase character :

x <- "First LastTitle, Company"

sub("(([A-Z][^A-Z]+){2})(.*)","\\1,\\3",x)
[1] "First Last,Title, Company"

answered Aug 13 '15 at 19:26

scoa

19,359
5
65
80

Inserting delimiter before nth uppercase letter in R string

3 Answers3