1

I am using R to get some data from Google Analytics API. In this particular scenario, I get the information about the Affinity Interest of my users segmented by gender and age group. The structure of the data I get looks something similiar to:

gender ageGroup interest        sessions
male   18-24    Autos           4
male   18-24    Autos/Luxury    1
male   18-24    Autos/Vans      1
male   25-34    Autos           8
male   25-34    Autos/Luxury    2
male   25-34    Autos/Vans      2
male   25-34    Autos/Compacts  1
...
female 65+      Fashion         20

The problem with this structure is, however, Autos as the main interest contains also the sessions of the sub-categories, and if I use this data in a Pivot Table I will get the wrong information.

Therefore, I am adding the subcategory "Generalists" to every main category as its own subcategory and splitting this column in two:

for (i2 in 1:nrow(ga.genderAgeAffinityTable) ) {

# main categories <- chrFound = integer(0)            
chrFound <- grep("[/]", ga.genderAgeAffinityTable$interest[i2] )

if (length(chrFound) < 1) {
ga.genderAgeAffinityTable$interest[i2] <- 
sprintf("%s/Generalists", ga.genderAgeAffinityTable$interest[i2])
}

ga.genderAgeAffinityTable <- as.data.frame
(cSplit(ga.genderAgeAffinityTable, "interest", sep = "/"))

}

View(ga.genderAgeAffinityTable)

            gender ageGroup interest        subcategory        sessions
            male   18-24    Autos           Generalists        4
            male   18-24    Autos           Luxury             1
            male   18-24    Autos           Vans               1
            male   25-34    Autos           Generalists        8
            male   25-34    Autos           Luxury             2
            male   25-34    Autos           Vans               2
            male   25-34    Autos           Compacts           1
            ...
            female 65+      Fashion         Generalists        20

I still have to get rid of the wrong sessions calculation, as for the first group (males, 18-24 years, Auto lovers), Generalists should have only 2 sessions (sessions - sum(other subcategories)). I am doing that using an auxId (genderAgeInterestSubcategory), summing all sessions by that auxId, merging the aggregated sessions as a new column in my dataframe and recalculating sessions for subcategory "Generalists":

ga.genderAgeAffinityTable$auxId <- sprintf("%s%s%s",
ga.genderAgeAffinityTable$gender, ga.genderAgeAffinityTable$age,
ga.genderAgeAffinityTable$interest_1 )

ga.interestAggregated <- aggregate(ga.genderAgeAffinityTable[,c("sessions")],
by=list(ga.genderAgeAffinityTable$auxId), "sum")

colnames(ga.interestAggregated) <- c("auxId", "aggregated")

ga.genderAgeAffinityTable <- (merge(ga.genderAgeAffinityTable,
ga.interestAggregated, by = 'auxId'))

for (i3 in 1:nrow(ga.genderAgeAffinityTable) ) {

if (ga.genderAgeAffinityTable$interest_2[i3] == "Generalists" ) {

# Do not recalculate sessions for interests with only Generalists as subcategory          
if (ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3] != 0 ) {

ga.genderAgeAffinityTable$sessions[i3] <-
ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3]
}

}

}

Do you know a more straightforward way to do the same without using an auxid?

micstr
  • 5,080
  • 8
  • 48
  • 76
agustin
  • 1,311
  • 20
  • 42
  • I think many people here would love to help you, but your example is not exactly a `minimal reproducible example` http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – rmuc8 Apr 14 '15 at 14:27

1 Answers1

3

Have you looked at the data.table package? It has amazing summarizing functionality that might be able to help you.

e.g.

library(data.table)
results <- DT[ , sum(sessions), by = subcategory]
# would give you total sessions per sub interest
#  which could help you subset when you then focus on Generalists.
#  to do multiple groups you would use by = .(gender, subcategory)

and you can create columns to get to your subsets using := . data.table is very powerful in the right hands to prevent all the looping you are needing to do. You need to key the data.

I am still a beginner, so others may have more efficient code below.

Please take a look at the data.table wiki and cheatsheet. The DT gurus/legends Matt and @Arun are very active on SO and are likely to pitch in and may be able to help you if you go this route.

We might need more detail on how you want to transform the data. i.e. "Generalists should have only 2 sessions" please confirm what you would expect the output to be. Do you need this just an output for each gender/ageGroup/Interest of the net sessions per Generalists?

DATA

To help others pitch in, here is the data of first two categories using dput

library(data.table)
DT <- data.table(gender = c("male", "male", "male", "male", "male","male", "male"), 
ageGroup = c("18-24", "18-24", "18-24", "25-34","25-34", "25-34", "25-34"),
interest = c("Autos", "Autos", "Autos","Autos", "Autos", "Autos", "Autos"),
subcategory = c("Generalists","Luxury", "Vans", "Generalists", "Luxury", "Vans", "Compacts"), 
sessions = c(4L, 1L, 1L, 8L, 2L, 2L, 1L) )

SOLUTION

Building this in stages, to help explain and show you how powerful is. This will get total of everything except Generalists.

notgensum <- DT[subcategory  != "Generalists", mysum := sum(sessions),
                by = .(gender, ageGroup, interest)]

    gender ageGroup interest subcategory sessions mysum
1:   male    18-24    Autos Generalists        4    NA
2:   male    18-24    Autos      Luxury        1     2
3:   male    18-24    Autos        Vans        1     2
4:   male    25-34    Autos Generalists        8    NA
5:   male    25-34    Autos      Luxury        2     5
6:   male    25-34    Autos        Vans        2     5
7:   male    25-34    Autos    Compacts        1     5

Taking this further, we minus the non Generalist number (I used average ignoring NAs to get at this number) off sessions number for Generalists. This makes myadjsessions: 2 for first one (4 -2 ) and 3 for 25-34 male Autos, as you wanted.

genadjsum2 <- notgensum[, myadjsessions := (sessions - mean(mysum, na.rm = T)),
                        by = .(gender, ageGroup, interest)]

#   gender ageGroup interest subcategory sessions mysum myadjsessions   
#1:   male    18-24    Autos Generalists        4    NA             2
#2:   male    18-24    Autos      Luxury        1     2            -1
#3:   male    18-24    Autos        Vans        1     2            -1
#4:   male    25-34    Autos Generalists        8    NA             3
#5:   male    25-34    Autos      Luxury        2     5            -3
#6:   male    25-34    Autos        Vans        2     5            -3
#7:   male    25-34    Autos    Compacts        1     5            -4

Data.table can be chained, i.e. DT[do this][and this], so if you only want the results for Generalists.

genadjsum3 <- notgensum[, 
             myadjsessions := (sessions - mean(mysum, na.rm = T)),
             by = .(gender, ageGroup, interest)][subcategory  == "Generalists"]

#  gender ageGroup interest subcategory sessions mysum myadjsessions
#1:   male    18-24    Autos Generalists        4    NA             2
#2:   male    25-34    Autos Generalists        8    NA             3

Finally, if you want to get rid of mysum temporary column, the syntax is

genadjsum3[, mysum := NULL]

And you will love no loops!

micstr
  • 5,080
  • 8
  • 48
  • 76