Find first and last occurrence of an item in R dataframe

Question

I have a dataframe with df with two columns called item and week. This is how it looks like:

 week           item
 1179           63230
 1179           63233
 1180           63230
 1180           63233
 1181           63230
 1181           63233

I would like to find the first and last week of occurrence of each item in the dataframe, and then compute the difference between last and end week. I tried to use the solution provided in this link - How can I find the first and last occurrences of an element in a data.frame?. The code that I used is as follows:

df_start <- df[!duplicated(df$item),]#Get starting week
df_end <- df[rev(!duplicated(rev(df$ITEM))),]#Get ending week

But, this does not give me the correct start and end week. Could someone help me out?

I am also including an expected output. df has about 3 million rows, but the output for the above portion of df should be as follows:

>df_start

item     start_week
63230     1179
63233     1179

>df_end

item      end_week
63230       1181
63233       1181

You use `df$ITEM` for the `df_end` code instead of `df$item` - R is case sensitive. — thelatemail, Nov 27 '17 at 05:52

Hardik Gupta · Answer 1 · 2017-11-27T05:47:59.527

You can use data.table like this

library(data.table)

df <- data.table(read.table(text = "week           item
                            1179           63230
                            1179           63233
                            1180           63230
                            1180           63233
                            1181           63230
                            1181           63233", header = T))

df <- df[order(item), c(2,1)]

df_start <- df[,.(start_week = week[1]), by = item]
df_end <- df[,.(end_week = week[.N]), by = item]

> df_start
    item start_week
1: 63230  1179
2: 63233  1179
> df_end
    item end_week 
1: 63230  1181
2: 63233  1181

neilfws · Accepted Answer · 2017-11-27T10:46:02.757

2

Making a few assumptions about your data:

week is numeric
item is always associated with at least one week (no NA weeks)
"last" is equivalent to "largest value" for week

Then this dplyr solution should work:

library(dplyr)
df %>% 
  group_by(item) %>% 
  summarise(diff = max(week) - min(week)) %>%
  ungroup()

# A tibble: 2 x 2
   item  diff
  <int> <dbl>
1 63230     2
2 63233     2

edited Nov 27 '17 at 10:46

answered Nov 27 '17 at 05:37

neilfws

32,751
5
50
63

1

@Rnovice you asked for the difference between the first and last "week" value by group, but this gives the difference between the highest and lowest values. These are equivalent if `week` is always increasing within your data, but your question wasn't clear on this. – Spacedman Nov 27 '17 at 08:50
Yes good point. My rushed answer assumed that "last" is equivalent to "largest", based on the example data. – neilfws Nov 27 '17 at 09:40
@Spacedman and @neilfws, fortunately for me, the `week` is always increasing in my data, so the largest value will correspond to the last value. – Rnovice Nov 27 '17 at 16:05

score 1 · Answer 3 · answered Nov 27 '17 at 16:30

Base R version is as simple as:

> aggregate(df$week, list(df$item),
          function(x){
            c(first=min(x),
               last=max(x),
               diff=diff(range(x)))
             })
  Group.1 x.first x.last x.diff
1   63230    1179   1181      2
2   63233    1179   1181      2

If you really did want the first and last occurrence in row order, then:

> aggregate(df$week, list(df$item),
          function(x){
             first=x[1]
             last=x[length(x)]
             return(c(f=first,l=last,diff=last-first))
            })

score 0 · Answer 4 · answered Nov 27 '17 at 05:49

My solution assumes it is already ordered.

do.call(rbind, lapply(unique(df$week), function(x){head(df[df$week == x,], 1)}))

Breaking it down. Finds all unique week values

unique(df$week)

Gets the first row of each week using head

function(x){head(df[df$week == x, ], 1)

Apply the function to each unique week value. This returns a list of data frames

lapply(unique(df$week), function(x){head(df[df$week == x,], 1)})

Then we combind the list of data frames using

do.call(rbind, list)

This can be also modified to be end of the week if you replace head with tail

Find first and last occurrence of an item in R dataframe

4 Answers4