31

I could solve this using loops, but I am trying think in vectors so my code will be more R-esque.

I have a list of names. The format is firstname_lastname. I want to get out of this list a separate list with only the first names. I can't seem to get my mind around how to do this. Here's some example data:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
tsplit <- strsplit(t,"_")

which looks like this:

> tsplit
[[1]]
[1] "bob"   "smith"

[[2]]
[1] "mary" "jane"

[[3]]
[1] "jose"  "chung"

[[4]]
[1] "michael" "marx"   

[[5]]
[1] "charlie" "ivan"   

I could get out what I want using loops like this:

for (i in 1:length(tsplit)){
    if (i==1) {t_out <- tsplit[[i]][1]} else{t_out <- append(t_out, tsplit[[i]][1])} 
}

which would give me this:

t_out
[1] "bob"     "mary"    "jose"    "michael" "charlie"

So how can I do this without loops?

JD Long
  • 59,675
  • 58
  • 202
  • 294
  • 2
    BTW it may be helpful if you could detail how this is different from your previous questions on the same topic: http://stackoverflow.com/questions/439526/thinking-in-vectors-with-r http://stackoverflow.com/questions/1246244/r-using-the-apply-function-on-a-data-frame-help-me-get-my-vector-victor http://stackoverflow.com/questions/445059/vectorize-my-thinking-vector-operations-in-r – Dirk Eddelbuettel Aug 31 '09 at 03:23
  • 4
    you mean my utter inability to really learn how to do apply functions in R? Yeah, same issue, different nuance. Thanks for reminding me. – JD Long Aug 31 '09 at 03:47

10 Answers10

43

And one more approach:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
pieces <- strsplit(t,"_")
sapply(pieces, "[", 1)

In words, the last line extracts the first element of each component of the list and then simplifies it into a vector.

How does this work? Well, you need to realise an alternative way of writing x[1] is "["(x, 1), i.e. there is a function called [ that does subsetting. The sapply call applies calls this function once for each element of the original list, passing in two arguments, the list element and 1.

The advantage of this approach over the others is that you can extract multiple elements from the list without having to recompute the splits. For example, the last name would be sapply(pieces, "[", 2). Once you get used to this idiom, it's pretty easy to read.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • Hadley, I see this works, but I haven't the slightest idea why it works. Is there an implied "]" somehow? Can you elaborate a bit? My R-foo is clearly weak. – JD Long Aug 31 '09 at 05:01
  • I was a little shocked by this, too, JD... so after a little playing, I see that: > "["(pieces,1) yields [[1]] [1] "bob" "smith" ... an interesting notation, to be sure, and very useful! – William Doane Aug 31 '09 at 15:34
  • Just as a side note, if you are going to split on fixed strings instead of regexps, you might want to consider passing `fixed=TRUE` to `strsplit`. I've found that this can have a large impact on the speed of `strsplit`. – Jonathan Chang Aug 31 '09 at 19:46
  • 6
    All operators in R are functions - infix operators can be written in prefix notation. TRUE || FALSE can be written as `||`(TRUE,FALSE), a[b] can be written as `[`(a,b), and even assignment operators a[b] <- TRUE is `[<-`(a,b,value=TRUE). R is magic. – hatmatrix Sep 01 '09 at 05:09
  • Not sure if it came out correctly there but there should be quotes (I used backtick but regular quotes should also work) around the prefix functions. – hatmatrix Sep 01 '09 at 05:10
  • thanks for posting an explanation. That makes sense to me now. The [ function was totally new to me. – JD Long Sep 02 '09 at 14:49
  • I love that this works, and I love Stephen's comment "R is magic". It's so true ! – PaulHurleyuk Apr 08 '10 at 12:00
  • what if its a long list and you want the last element? – zach Feb 28 '12 at 19:19
26

You can use apply (or sapply)

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
f <- function(s) strsplit(s, "_")[[1]][1]
sapply(t, f)

bob_smith    mary_jane   jose_chung michael_marx charlie_ivan 

       "bob"       "mary"       "jose"    "michael"    "charlie" 

See: A brief introduction to “apply” in R

slhck
  • 36,575
  • 28
  • 148
  • 201
liebke
  • 376
  • 2
  • 6
10

How about:

tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
fnames <- gsub("(_.*)$", "", tlist)
# _.* matches the underscore followed by a string of characters
# the $ anchors the search at the end of the input string
# so, underscore followed by a string of characters followed by the end of the input string

for the RegEx approach?

William Doane
  • 1,416
  • 12
  • 20
  • 1
    +1 for being the fastest. With rep(t, 1e4), my approach took 83.23 seconds (81.41 of which were spent converting to a data frame!), David's took 4.39s, and yours took 0.81. I think it has the best output, too. – Matt Parker Aug 31 '09 at 03:23
  • 1
    Thanks, Matt... I was wondering about the efficiency of each of these solutions! – William Doane Aug 31 '09 at 03:31
  • 1
    that's really informative. I had just assumed the strsplit bit was a given. Wow. Good to see another way of doing it. – JD Long Aug 31 '09 at 03:49
9

what about:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")

sub("_.*", "", t)
Karsten
  • 91
  • 1
  • 1
7

I doubt this is the most elegant solution, but it beats looping:

t.df <- data.frame(tsplit)
t.df[1, ]

Converting lists to data frames is about the only way I can get them to do what I want. I'm looking forward to reading answers by people who actually understand how to handle lists.

Matt Parker
  • 26,709
  • 7
  • 54
  • 72
  • I like this. I 'get' the data.frame structure. And since my real data has the same number of items in each "name" then this should not be less memory efficient. Why didn't I think of this! – JD Long Aug 31 '09 at 01:37
  • Note that this approach takes a hell of a long time with larger data - see my comment on William Doane's answer. – Matt Parker Aug 31 '09 at 03:24
4

You almost had it. It really is just a matter of

  1. using one of the *apply functions to loop over your existing list, I often start with lapply and sometimes switch to sapply
  2. add an anonymous function that operates on one of the list elements at a time
  3. you already knew it was strsplit(string, splitterm) and that you need the odd [[1]][1] to pick off the first term of the answer
  4. just put it all together, starting with a preferred variable namne (as we stay clear of t or c and friends)

which gives

> tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan") 
> fnames <- sapply(tlist, function(x) strsplit(x, "_")[[1]][1]) 
> fnames 
  bob_smith    mary_jane   jose_chung michael_marx charlie_ivan   
      "bob"       "mary"       "jose"    "michael"    "charlie" 
>
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I really have struggled with getting my mind around properly using the apply functions in R. Some days it feels like learning to drive on the opposite side of the road.. it's really not hard but the simple round-a-bouts result in a mental log jam. – JD Long Sep 02 '09 at 14:51
  • 1
    I do it in a leg-alike fashion. You knew strsplit. You knew you needed an 'anon function' of one parameter for the apply family. Just stick'em together.... Lastly, and not to nit-pick, I posted this before the essentially identical but less verbose answer you accepted as 'the' answer. – Dirk Eddelbuettel Sep 02 '09 at 15:53
  • Typo: 'lego-alike', not 'leg-alike' – Dirk Eddelbuettel Sep 02 '09 at 15:53
  • Dirk, one of the things I have noticed about being a novice at R is that it is very hard to see that two given problems are similar. I think with expertise comes the ability to chose meaningful analogies quickly. I'm slowly getting to where I can see patterns. I appreciate your comment above about figuring out what the lego bricks are. I'm still growing in my ability to look at a problem and see that I need an anon function, for example. – JD Long Sep 09 '09 at 15:57
3

You could use unlist():

> tsplit <- unlist(strsplit(t,"_"))
> tsplit
 [1] "bob"     "smith"   "mary"    "jane"    "jose"    "chung"   "michael"
 [8] "marx"    "charlie" "ivan"   
> t_out <- tsplit[seq(1, length(tsplit), by = 2)]
> t_out
[1] "bob"     "mary"    "jose"    "michael" "charlie"

There might be a better way to pull out only the odd-indexed entries, but in any case you won't have a loop.

brentonk
  • 1,308
  • 1
  • 13
  • 14
2

And one other approach, based on brentonk's unlist example...

tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
tsplit <- unlist(strsplit(tlist,"_"))
fnames <- tsplit[seq(1:length(tsplit))%%2 == 1]

William Doane
  • 1,416
  • 12
  • 20
1

I would use the following unlist()-based method:

> t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
> tsplit <- strsplit(t,"_")
> 
> x <- matrix(unlist(tsplit), 2)
> x[1,]
[1] "bob"     "mary"    "jose"    "michael" "charlie"

The big advantage of this method is that it solves the equivalent problem for surnames at the same time:

> x[2,]
[1] "smith" "jane"  "chung" "marx"  "ivan" 

The downside is that you'll need to be certain that all of the names conform to the firstname_lastname structure; if any don't then this method will break.

Sumit Singh
  • 15,743
  • 6
  • 59
  • 89
jmc200
  • 11
  • 1
0

from the original tsplit list object given at the beginning, this command will do:

unlist(lapply(tsplit,function(x) x[1]))

it extracts the first element of all list elements, then transforms a list to a vector. Unlisting first to a matrix, then extracting the fist column is also ok, but then you are dependent on the fact that all list elements have the same length. Here is the output:

> tsplit

[[1]]
[1] "bob"   "smith"

[[2]]
[1] "mary" "jane"

[[3]]
[1] "jose"  "chung"

[[4]]
[1] "michael" "marx"   

[[5]]
[1] "charlie" "ivan"   

> lapply(tsplit,function(x) x[1])

[[1]]
[1] "bob"

[[2]]
[1] "mary"

[[3]]
[1] "jose"

[[4]]
[1] "michael"

[[5]]
[1] "charlie"

> unlist(lapply(tsplit,function(x) x[1]))

[1] "bob"     "mary"    "jose"    "michael" "charlie"
David Arenburg
  • 91,361
  • 17
  • 137
  • 196