1

I need help regarding the data manipulation in R .

My dataset looks something like this.

Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28

Now I want to identify the number of changes that “Smith” has gone through (two) based on combination of Name and country only.

Basically, I want to compare each datapoint with other datapoints and identify the count of changes that have been there in the entire dataset for the combination of Name and Country only.

acylam
  • 18,231
  • 5
  • 36
  • 45
Jay
  • 835
  • 1
  • 6
  • 11
  • Is it two for "Smith" because you also count the original? Wouldn't it be sufficient to just group by `Name` and count the number of rows for each group? – acylam Apr 16 '18 at 15:08
  • Yes.. I am counting the original one as well.. incase I group, I will eventually count the number of rows in total for that group rather than counting the changes that were made across Name and country combined together. – Jay Apr 16 '18 at 15:12
  • Would a new row for the same name always indicate a change in one of the variables, or can it be a duplicate? – acylam Apr 16 '18 at 15:14
  • 1
    I won’t count it as a duplicate, since my objective is to track the multiple changes for which a particular data point has gone through.. so if it is from a->b->a->a->a->b->a... so my total changes will be 4, irrespective of duplicate combinations coming in the later phase of the dataset. – Jay Apr 16 '18 at 21:37

1 Answers1

0

You can do this by comparing the lag values of the combination with it's current value by group using dplyr:

library(dplyr)

df %>%
  group_by(Name) %>%
  mutate(combination = paste(country, age),
         lag_combination = lag(combination,  default = 0, order_by = Name),
         Changes = cumsum(combination != lag_combination)) %>%
  slice(n()) %>%
  select(Name, Changes)

Result:

# A tibble: 3 x 2
# Groups:   Name [3]
    Name Changes
  <fctr>   <int>
1   Avin       2
2  Robin       1
3  Smith       3

Notes:

  • dplyr:lag does not respect group_by(Name), so you need to add order_by = Name to lag by Name.

  • I'm setting a default using default = 0 so that the first entry of each group would not be NA.

Data:

df = read.table(text="Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Smith, Canada, 27
Robin, France, 28
Avin, France, 26", header = TRUE, sep = ',')
acylam
  • 18,231
  • 5
  • 36
  • 45
  • Thanks... but this would give me distinct count, not the multiple changes that have happen in the datapoints for the required combination. , even when duplicate items for the required combination appears later. – Jay Apr 16 '18 at 21:39
  • @Jay I've updated my answer. This should give you what you are looking for. Note the more informative dataset I have used instead of yours. – acylam Apr 17 '18 at 14:25