0

EDIT: Revamped reflecting comments below

I've got some data on workers across time. They may work in more than one position in any given year; I want to subset the data to get workers who have at least one position with certain characteristics.

Here's my toy data:

set.seed(1643)
dt<-data.table(id=rep(1:1000,10),
               area=sample(letters,1e4,replace=T),
               position=sample(10,1e4,replace=T),
               firm_type=sample(5,1e4,replace=T),
               year=rep(2001:2010,each=1000),key="id")

I only want workers in area d,o,w,l,e at position 7.

Unfortunately, the way that firm_type was coded changed from 2005 onwards; prior to 2005, the relevant workers were all at firms with firm_type==1. Thereafter, types 1 and 2 are acceptable.

I tried this look-up but it doesn't work:

dt[.(dt[firm_type %in% ifelse(year<2005,1,1:2)
    &area %in% c("d","o","w","l","e")
    &position==7,unique(id)])]

Specifically, the %in% operator, as mentioned in the comments below, can't operate row-by-row, so we get (intermediate) output like:

> dt[firm_type %in% ifelse(year<2005,1,1:2)
+    &area %in% c("d","o","w","l","e")
+    &position==7,table(firm_type,year)]
         year
firm_type 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
        1    4    2    5    2    3    7    1    0    4    1
        2    2    4    4    6    4    5    9    8    1    2

@Frank inspired this workaround:

dt[.(dt[ifelse(year<2005,firm_type==1,
           firm_type %in% 1:2)
    &area %in% c("d","o","w","l","e")
    &position==7,unique(id)])]

I'm content with this, but I was encouraged to see if there's a better approach to this problem because ifelse is not optimized.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 2
    `ifelse` won't return a vector from one conditional, it will just give you the first letter back – Rorschach Jun 09 '15 at 18:56
  • 1
    Yeah, the `ifelse` you've written is the same as `ifelse(yr<2005,rep(letters[1:2],1000/2),rep(letters[3:4],1000/2))`. If that is what you were going for, just use extra parens (in 1.9.4), like `dt[(grp%in%X)]` – Frank Jun 09 '15 at 19:02
  • @6pool this doesn't agree with the output we see... if that were the case `b` and `d` would be 0d out appropriately – MichaelChirico Jun 09 '15 at 19:02
  • @Frank in my actual example, the `dt` is keyed by a different variable, so I don't think that syntax will work – MichaelChirico Jun 09 '15 at 19:04
  • I'm making the same point 6pool is: you cannot have different right-hand sides of `%in%` for each row. It's not vectorized in that way. The `ifelse` evaluates to a single vector, and that is what is used as the RHS. – Frank Jun 09 '15 at 19:05
  • You shouldn't use `ifelse` in `data.table` anyway (at least until Arun and Matt optimize it). It's inefficient. – Roland Jun 09 '15 at 19:07
  • @Roland see the update at the bottom of my question; I don't think I can do the search I need to do in a binary fashion... – MichaelChirico Jun 09 '15 at 19:12
  • 1
    @Frank thanks, that's the key. This works: `dt[ifelse(yr<2005,grp %in% letters[1:2],grp %in% letters[3:4]),]` – MichaelChirico Jun 09 '15 at 19:12
  • I'm not sure what the best alternative to `ifelse` is here. Your `|` requires both vectors to be constructed, while you could alternately do `dt[yr<2005,tag:=grp%in%letters[1:2]][yr>=2005,tag:=grp%in%letters[3:4]][(tag)]` – Frank Jun 09 '15 at 19:13
  • I think the efficiency improvement is not large if there is any, though (since `[` has overhead). I'd probably consider just doing the `ifelse` from your last comment. Dunno what @Roland thinks. – Frank Jun 09 '15 at 19:14
  • 1
    anyway this is a one-time operation, so the efficiency loss is not too painful – MichaelChirico Jun 09 '15 at 19:15
  • @Frank `ifelse` means copies. That's slow. – Roland Jun 09 '15 at 19:20
  • 2
    @MichaelChirico I think this question would benefit from a full and clear problem description, which would enable people to suggest alternative approaches. (And please set the random seed!) – Roland Jun 09 '15 at 19:20
  • Are you using 1.9.4? If so it should be the auto indexing bug... that's fixed in 1.9.5. – Arun Jun 09 '15 at 21:19
  • @Roland check out the updated question. As I was writing it, @Frank's suggestion to just defined a `tag` indicator began to seem more appealing--though I always hate defining one-time-use indicators due to clutter – MichaelChirico Jun 09 '15 at 21:35
  • If it's a subsetting operation you're after, why are you doing a join `dt[.(dt[...])]`? Are you just trying to select `id`s that meet the criterion? – Frank Jun 09 '15 at 22:03
  • In essence, yes; should I ellaborate on this: "I want to subset the data to get workers who have at least one position with certain characteristics" – MichaelChirico Jun 09 '15 at 22:05
  • Yeah, that clarifies it, thanks. – Frank Jun 09 '15 at 22:08

2 Answers2

1

A faster way. You could defer the ifelse until you have a smaller subset:

dt[ position==7L & area%in%c("d","o","w","l","e") & firm_type%in%1:2
][ifelse(year<2005,firm_type==1L,firm_type %in% 1:2),
  unique(id)
]

Depending on how readable you find it, you could also do:

dt[ position==7L & area%in%c("d","o","w","l","e") & firm_type%in%1:2
][!(year < 2005 & firm_type==2L),
  unique(id)
]

Regarding ifelse. ifelse(cond,yes,no) is slow because it calculates all of yes and no if it needs either of them, as documented by @RicardoSaporta. Another idea -- (cond&yes)|((!cond)&no) -- mentioned in an earlier iteration of the OP, has the same problem.


The verbose way. If your conditions are messier, you might want to make them explicit:

my_areas     = c("d","o","w","l","e")
my_posns     = 7L
my_yearfirms = data.table(year=unique(dt$year))[,.(
  firm_type = if (year<2005) 1L else 1:2
),by=year]

merge(dt[position%in%my_posns & area%in%my_areas],my_yearfirms,by=c("year","firm_type"))[,
  unique(id)
]

The final piece of code can be

  • skipped over (guessing what is going on by context) and
  • reused elsewhere (if you change your conditions).

This is what I would do unless efficiency was very important.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
0

Just use an "or" in addition to "and":

> dt[((firm_type == 1 ) | (firm_type ==2 & year>=2005))
+    &area %in% c("d","o","w","l","e")
+    &position==7,]
patapouf_ai
  • 17,605
  • 13
  • 92
  • 132