1

Introduction

I have longitudinal data in wide format, measuring the total value in sales that a company had per year. From this, I want to create a new set of variables--market share--for each year in the data, for each company. The full data set is too large to do this the long, clumsy way, so I tried to do it by running a function on a subset (i.e. the columns marking the sales data for each year), using sapply.

However, the results do not seem to produce 'real' variables, as they show up in printing (head()) but not in reality (names()). Is something wrong with my code?

# SAMPLE DATA
agyrw <- structure(list(company = c(28, 128, 22, 72, 62, 65, 132, 89, 46, 105), value.1993 = c(79272, 35850, 2124, 32, 0, 0, 0, 26359, 0, 0), value.1994 = c(103974, 10219, 31432, 0, 0, 0, 3997, 469, 0, 0)), .Names = c("company", "value.1993", "value.1994"), row.names = c(9L, 42L, 1L, 30L, 22L, 28L, 51L, 34L, 20L, 40L), class = "data.frame")

agyrw2 <- agyrw     # FOR A LATER COMPARISON

agyrw
#      company value.1993 value.1994
#         28      79272     103974
#        128      35850      10219
#         22       2124      31432
#         72         32          0
#         62          0          0
#         65          0          0
#        132          0       3997
#         89      26359        469
#         46          0          0
#        105          0          0

Clumsy Long Way

# SUM TOTAL VALUE BY YEAR
total.1993 <- sum(agyrw$value.1993)
total.1994 <- sum(agyrw$value.1994)

# CALCULATE THE MARKET SHARE FOR EACH IMPORTER, BY YEAR
agyrw$share.1993 <- agyrw$value.1993 / total.1993
agyrw$share.1994 <- agyrw$value.1994 / total.1994

# FORMAT THE MARKET SHARE VARIABLE TO ONLY FOUR DECIMAL PLACES
agyrw$share.1993 <- format(round(agyrw$share.1993, 4), nsmall = 4)
agyrw$share.1994 <- format(round(agyrw$share.1994, 4), nsmall = 4)

# RECONVERT THE MARKET SHARE VARIABLE BACK INTO NUMERIC
agyrw$share.1993 <- as.numeric(agyrw$share.1993)
agyrw$share.1994 <- as.numeric(agyrw$share.1994)

# VIEW
agyrw
#       company value.1993 value.1994 share.1993 share.1994
#          28      79272     103974     0.5519     0.6927
#         128      35850      10219     0.2496     0.0681
#          22       2124      31432     0.0148     0.2094
#          72         32          0     0.0002     0.0000
#          62          0          0     0.0000     0.0000
#          65          0          0     0.0000     0.0000
#         132          0       3997     0.0000     0.0266
#          89      26359        469     0.1835     0.0031
#          46          0          0     0.0000     0.0000
#         105          0          0     0.0000     0.0000

Parsimonious Attempt

agyrw2$share <- sapply(agyrw2[,2:3], function(x) {
    total <- sum(x)
    share <- as.numeric(format(round(x/total, 4), nsmall = 4))
    return(share)
    }
       )

# VIEW
agyrw2
#      company value.1993 value.1994 share.value.1993 share.value.1994
#          28      79272     103974           0.5519           0.6927
#         128      35850      10219           0.2496           0.0681
#          22       2124      31432           0.0148           0.2094
#          72         32          0           0.0002           0.0000
#          62          0          0           0.0000           0.0000
#          65          0          0           0.0000           0.0000
#         132          0       3997           0.0000           0.0266
#          89      26359        469           0.1835           0.0031
#          46          0          0           0.0000           0.0000
#         105          0          0           0.0000           0.0000

Problem Upon initial inspection, everything looks fine. The results of agyrw2 using sapply on the function look the same as the results of agyrw created by the clumsy code (save for slightly different column names).

But when I try to call any of the newly created variables in agyrw2, they seemingly don't exist, despite showing up when printed out. For example, calling on the column names produces only one agyrw2$share column:

names(agyrw)
#[1] "company"   "value.1993" "value.1994" "share.1993" "share.1994"
names(agyrw2)
#[1] "company"   "value.1993" "value.1994" "share"  

How can I rewrite the function so that it actually produces new columns in the data frame?

coip
  • 1,312
  • 16
  • 30

2 Answers2

1

What about:

agyrw2 <- cbind(agyrw2,do.call(cbind, lapply(agyrw2[,2:3], function(x) {
    total <- sum(x)
    share <- as.numeric(format(round(x/total, 4), nsmall = 4))
    return(share)
    })))

or simply :

agyrw2$share.1993 <- as.numeric(format(round(agyrw2$value.1993 / sum(agyrw2$value.1993), 4), nsmall = 4))
agyrw2$share.1994 <- as.numeric(format(round(agyrw2$value.1994 / sum(agyrw2$value.1994), 4), nsmall = 4))
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • I get the following error with that code: `Error in do.call(cbind, sapply(agyrw2[, 2:3], function(x) { : second argument must be a list` – coip Mar 10 '16 at 19:09
  • try changing `sapply()` by `lapply()` – HubertL Mar 10 '16 at 19:26
  • That works! But a new problem arises; it duplicates variable names: `names(agyrw2)[1] "company" "value.1993" "value.1994" "value.1993" "value.1994"` Is there a way to rename the second `value.1993` and second `value.1994` something else, like `share.1993` / `share.1994` in the same function? Just something to automatically assign the new columns unique names, as there will be around 50 new columns in the actual, full data set, so I'd like to automate it all in the same function. – coip Mar 10 '16 at 19:35
1

The problem is that share is actually a 2-column matrix, not 2 separate columns. The matrix columns are given names value.1993 and value.1994 but it's still a single object.

You can do this kind of thing in base R, but for data munging and transformation, it's better to use one of the packages designed specifically for it.

In dplyr:

library(dplyr)
agyrw %>%
    mutate(share93=value.1993/sum(value.1993), share94=value.1994/sum(value.1994))

And if you have multiple columns:

vars <- names(agyrw[-1])
names(vars) <- paste0(vars, ".share")
agyrw %>% mutate_each_(funs(./sum(.)), vars)

In sqldf:

library(sqldf)
names(agyrw) <- c("company", "value1993", "value1994")  # use syntactically valid SQL names
sqldf("select company, value1993, value1994,
              value1993/sum1993 as share1993,
              value1994/sum1994 as share1994
       from (agyrw join (
             select sum(value1993) as sum1993, sum(value1994) as sum1994 from agyrw))")
Community
  • 1
  • 1
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • The `mutate` code works but is there a way to run it on a ton of columns automatically? The full data set comprises 50 years of data, which is why I tried doing it in a function paired with `sapply` running on a subset--`agyrw2[,2:3]`, which really would be like `agyrw2[,2:51]`--so that I didn't have to clumsily tweak dozens of lines of code. – coip Mar 10 '16 at 19:20