-2

I have a dataframe with df with two columns called item and week. This is how it looks like:

 week           item
 1179           63230
 1179           63233
 1180           63230
 1180           63233
 1181           63230
 1181           63233

I would like to find the first and last week of occurrence of each item in the dataframe, and then compute the difference between last and end week. I tried to use the solution provided in this link - How can I find the first and last occurrences of an element in a data.frame?. The code that I used is as follows:

df_start <- df[!duplicated(df$item),]#Get starting week
df_end <- df[rev(!duplicated(rev(df$ITEM))),]#Get ending week

But, this does not give me the correct start and end week. Could someone help me out?

I am also including an expected output. df has about 3 million rows, but the output for the above portion of df should be as follows:

>df_start

item     start_week
63230     1179
63233     1179

>df_end

item      end_week
63230       1181
63233       1181
Rnovice
  • 333
  • 1
  • 5
  • 18

4 Answers4

3

You can use data.table like this

library(data.table)

df <- data.table(read.table(text = "week           item
                            1179           63230
                            1179           63233
                            1180           63230
                            1180           63233
                            1181           63230
                            1181           63233", header = T))

df <- df[order(item), c(2,1)]

df_start <- df[,.(start_week = week[1]), by = item]
df_end <- df[,.(end_week = week[.N]), by = item]

> df_start
    item start_week
1: 63230  1179
2: 63233  1179
> df_end
    item end_week 
1: 63230  1181
2: 63233  1181
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83
2

Making a few assumptions about your data:

  • week is numeric
  • item is always associated with at least one week (no NA weeks)
  • "last" is equivalent to "largest value" for week

Then this dplyr solution should work:

library(dplyr)
df %>% 
  group_by(item) %>% 
  summarise(diff = max(week) - min(week)) %>%
  ungroup()

# A tibble: 2 x 2
   item  diff
  <int> <dbl>
1 63230     2
2 63233     2
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • 1
    @Rnovice you asked for the difference between the first and last "week" value by group, but this gives the difference between the highest and lowest values. These are equivalent if `week` is always increasing within your data, but your question wasn't clear on this. – Spacedman Nov 27 '17 at 08:50
  • Yes good point. My rushed answer assumed that "last" is equivalent to "largest", based on the example data. – neilfws Nov 27 '17 at 09:40
  • @Spacedman and @neilfws, fortunately for me, the `week` is always increasing in my data, so the largest value will correspond to the last value. – Rnovice Nov 27 '17 at 16:05
1

Base R version is as simple as:

> aggregate(df$week, list(df$item),
          function(x){
            c(first=min(x),
               last=max(x),
               diff=diff(range(x)))
             })
  Group.1 x.first x.last x.diff
1   63230    1179   1181      2
2   63233    1179   1181      2

If you really did want the first and last occurrence in row order, then:

> aggregate(df$week, list(df$item),
          function(x){
             first=x[1]
             last=x[length(x)]
             return(c(f=first,l=last,diff=last-first))
            })
Spacedman
  • 92,590
  • 12
  • 140
  • 224
0

My solution assumes it is already ordered.

do.call(rbind, lapply(unique(df$week), function(x){head(df[df$week == x,], 1)}))

Breaking it down. Finds all unique week values

unique(df$week) 

Gets the first row of each week using head

function(x){head(df[df$week == x, ], 1)

Apply the function to each unique week value. This returns a list of data frames

lapply(unique(df$week), function(x){head(df[df$week == x,], 1)})

Then we combind the list of data frames using

do.call(rbind, list)

This can be also modified to be end of the week if you replace head with tail

Jt Miclat
  • 134
  • 7