Hierarchical data aggregation and manipulation

Question

I am a novice to r and I am trying to do deal with some inconsistencies in my data. My problem is twofold, the first part could be of general interest and it is about how to aggregate data that is classified in a vector with multiple levels of aggregation. The second problem is more closely related to my coding issues and it is about performing some specific operations for my data.

I am looking at exports data of hundreds of countries over a period of two decades. The issue is that my data on exports are classified by product and sub-product categories (hundreds), in an inconsistent manner and I am trying to deal with these discrepancies.

The data looks roughly like this:

df <- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA","EU", "EU","EU", "EU", "EU", "EU", "EU", "EU"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU","USA", "USA", "USA","USA","USA", "USA","USA", "USA"), 
"Product cat." = c("1", "1.1", "1.2","2", "2.1", "2.2","3","1", "1.1","2", "2.1", "2.2","3","3.1", "3.2"), 
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970,1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970), 
"Val" = c(200, 170, 30, 100, 50, 40,  220, 230, 180, 80, 50, 20, 170, 40, 130), 
stringsAsFactors = FALSE)

Product category 1.1 (eg. apple) and 1.2 (e.g. bananas) are sub-product categories of product category 1 (e.g. fruit). Product category 2.1 and 2.2 are subcategories of product cat. 2 and so on.

My ultimate goals are the following: First, the "value" of sub-product categories should always equate to the higher product category value. It is the case of USA export to the EU, product cat 1.1 (val=170) and 1.2 (val=30) aggregate to the level of product cat 1 (val=200). However, this is often not the case. For instance, in the case of USA export to the EU, the value of product cat 2.1 (val=50) and 2.2 (val=40) is smaller than product cat 2 (val=100). To deal with this issue I need to create a new sub-product cat. Ideally, this would (automatically) combine the beginning of the name of the product cat with a K (hence 2.K). This should be given a value of the difference between product cat 2 and its sub-product cat 2.1 and 2.2 (2.K= 100-(50+40) = 10). Also, I would like to apply the same approach to cases where I lack data on one of the subproduct categories. An example is in the exports from EU to USA where there are only values for product cat 1 and sub-product cat 1.1 and no information on cat 1.2. Ideally, I would create a new product cat (1.K) with the value of the difference between product cat 1 (val=230) and its sub-product cat 1.1(val = 180). Hence, the value of 1.k would be 230-180 = 50.

The second problem is that in some cases I do not have data on the sub-product categories, but I only have data at the aggregate level. As in the case of USA export to EU product cat 3 (that has no sub-categories). I would like to create a new sub-product cat a new that combines the beginning of the product cat with an M (hence 2.M) and incorporates the value at the product category level that is not reported in the subcategory level. Hence, for instance in the case of USA export to EU product cat 3 (220), 3.M = 220.

As mentioned, I think that there are two steps to deal with my coding issues. The first is on how to aggregate data that is hierarchical (to note that in my actual data I have three, not two, sub-product level (e.g. 1 food, 1.1fruit, 1.1.1 apples). Ideally, I would prefer avoiding creating new columns as my dataset involves hundreds of product categories. The second part is about performing the specific operations described above: 1) creating a new category with the difference between the father and child nodes, 2) creating fictitious child nodes. I would be really thankful to anyone that could help me with this as is key for the development of my paper.

I do realize it is a complex question, but also partial answers are very welcomed.

I thank you all in advance for your help

==============

Thank you a LAP lot for your help, Here is the problem I face with the real data after applying the function

split2 <- lapply(split1, function(x){
y <- rbind.data.frame(x, x[1,])
y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
return(y)
})

and the funtion split3 <- do.call(rbind, split2)

and here are the dput of the head of the two splits

>dput(Headsplit2)
list(`Algeria.United Arab Emirates.05` = structure(list(Reporter = 
c("Algeria", 
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates", 
"United Arab Emirates", "United Arab Emirates", "United Arab 
Emirates"
), Year = c(2001L, 2001L, 2001L, 2001L), Product.cat. = c("05", 
"052", "054", "05k"), `Commodity Description` = c("Fruit and 
vegetables", 
"Dried fruit including artificially dehydrated", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables"), `Trade Value` = 
structure(c(7.61814641291993e-319, 
7.4539189922423e-319, 1.64178014113046e-320, 7.61814641291993e-319
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"), Prodcat2 = 
c("05", 
"05", "05", "05")), row.names = c(NA, -4L), vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(0:2), 
group_sizes = 3L, biggest_group_size = 3L, labels = structure(list(
Reporter = "Algeria", Partner = "United Arab Emirates", Prodcat2 = 
"05", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), `Algeria.United Kingdom.05` = 
structure(list(
Reporter = c("Algeria", "Algeria", "Algeria", "Algeria"), 
Partner = c("United Kingdom", "United Kingdom", "United Kingdom", 
"United Kingdom"), Year = c(2001L, 2001L, 2001L, 2001L), 
Product.cat. = c("05", "053", "054", "05k"), `Commodity Description` 
= c("Fruit and vegetables", 
"Fruit,preserved and fruit preparations", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables"), `Trade Value` = 
structure(c(6.99399328252869e-320, 
3.16547859290487e-320, 3.82802062397798e-320, 6.99399328252869e-320
), class = "integer64"), Prodcat1 = c("0", "0", "0", "0"), 
Prodcat2 = c("05", "05", "05", "05")), row.names = c(NA, 
-4L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
0:2), group_sizes = 3L, biggest_group_size = 3L, labels = 
structure(list(
Reporter = "Algeria", Partner = "United Kingdom", Prodcat2 = "05", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Austria.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner = 
c("Austria", 
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria", 
"Austria", "Austria"), Year = c(2000L, 2001L, 2000L, 2000L, 
2001L, 2000L, 2000L, 2001L, 2000L), Product.cat. = c("26", 
"26", "263", "265", "265", "266", "267", "267", "26k"), `Commodity 
Description` = c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Cotton", 
"Vegetable fibres,except cotton and jute", "Vegetable fibres,except 
cotton and jute", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Waste materials from textile fabrics, incl.rags", "Textile fibres, 
not manufactured, and waste"
), `Trade Value` = structure(c(7.3714594359514e-318, 
9.95542276370112e-318, 
4.94065645841247e-320, 2.96439387504748e-320, 6.91691904177745e-320, 
2.32210853545386e-319, 6.33886223614319e-318, 9.60957681161225e-318, 
7.3714594359514e-318), class = "integer64"), Prodcat1 = c("2", 
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA, 
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
c(0L, 2L, 3L, 5L, 6L), c(1L, 4L, 7L)), group_sizes = c(5L, 
3L), biggest_group_size = 5L, labels = structure(list(Reporter = 
c("Hungary", 
"Hungary"), Partner = c("Austria", "Austria"), Prodcat2 = c("26", 
"26"), Year = 2000:2001), row.names = c(NA, -2L), class = 
"data.frame", vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Belgium.26 = structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), Partner = 
c("Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium"), Year = c(2000L, 2001L, 2000L, 2001L, 
2000L, 2001L, 2000L, 2001L, 2000L), Product.cat. = c("26", 
"26", "265", "265", "266", "266", "267", "267", "26k"), `Commodity 
Description` = c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Vegetable 
fibres,except cotton and jute", 
"Vegetable fibres,except cotton and jute", "Synthetic and regenerated 
artificial fibres", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Waste materials from textile fabrics, incl.rags", "Textile fibres, 
 not manufactured, and waste"
 ), `Trade Value` = structure(c(3.41893426922143e-318, 
7.98410083679454e-318, 
3.95252516672997e-320, 9.73309322307256e-319, 1.67488253940183e-318, 
1.665001226485e-318, 8.49792910846944e-319, 7.70742407512345e-319, 
3.41893426922143e-318), class = "integer64"), Prodcat1 = c("2", 
"2", "2", "2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26", "26", "26", "26")), row.names = c(NA, 
-9L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
TRUE, indices = list(
c(0L, 2L, 4L, 6L), c(1L, 3L, 5L, 7L)), group_sizes = c(4L, 
4L), biggest_group_size = 4L, labels = structure(list(Reporter = 
c("Hungary", 
"Hungary"), Partner = c("Belgium", "Belgium"), Prodcat2 = c("26", 
"26"), Year = 2000:2001), row.names = c(NA, -2L), class = 
 "data.frame", vars = c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")), Hungary.Bulgaria.26 = 
structure(list(
Reporter = c("Hungary", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary"), Partner = c("Bulgaria", "Bulgaria", 
"Bulgaria", "Bulgaria", "Bulgaria", "Bulgaria"), Year = c(2000L, 
2001L, 2000L, 2001L, 2000L, 2000L), Product.cat. = c("26", 
"26", "266", "266", "267", "26k"), `Commodity Description` = 
c("Textile fibres, not manufactured, and waste", 
"Textile fibres, not manufactured, and waste", "Synthetic and 
regenerated artificial fibres", 
"Synthetic and regenerated artificial fibres", "Waste materials from 
textile fabrics, incl.rags", 
"Textile fibres, not manufactured, and waste"), `Trade Value` = 
structure(c(1.55136612794151e-318, 
1.53160350210786e-319, 4.94065645841247e-321, 4.94065645841247e-321, 
2.96439387504748e-320, 1.55136612794151e-318), class = "integer64"), 
Prodcat1 = c("2", "2", "2", "2", "2", "2"), Prodcat2 = c("26", 
"26", "26", "26", "26", "26")), row.names = c(NA, -6L), vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE, indices = list(c(0L, 
 2L, 4L), c(1L, 3L)), group_sizes = 3:2, biggest_group_size = 3L, 
labels = structure(list(
Reporter = c("Hungary", "Hungary"), Partner = c("Bulgaria", 
"Bulgaria"), Prodcat2 = c("26", "26"), Year = 2000:2001), row.names = 
c(NA, 
-2L), class = "data.frame", vars = c("Reporter", "Partner", 
"Prodcat2", 
"Year"), drop = TRUE), class = c("grouped_df", "tbl_df", "tbl", 
"data.frame")), Hungary.Canada.26 = structure(list(Reporter = 
 c("Hungary", 
 "Hungary", "Hungary"), Partner = c("Canada", "Canada", "Canada"
 ), Year = c(2001L, 2001L, 2001L), Product.cat. = c("26", "265", 
 "26k"), `Commodity Description` = c("Textile fibres, not 
 manufactured, and waste", 
 "Vegetable fibres,except cotton and jute", "Textile fibres, not 
 manufactured, and waste"
 ), `Trade Value` = structure(c(8.89318162514244e-320, 
 6.4228533959362e-320, 
 8.89318162514244e-320), class = "integer64"), Prodcat1 = c("2", 
 "2", "2"), Prodcat2 = c("26", "26", "26")), row.names = c(NA, 
 -3L), vars = c("Reporter", "Partner", "Prodcat2", "Year"), drop = 
 TRUE, indices = list(
 0:1), group_sizes = 2L, biggest_group_size = 2L, labels = 
structure(list(
Reporter = "Hungary", Partner = "Canada", Prodcat2 = "26", 
Year = 2001L), row.names = c(NA, -1L), class = "data.frame", vars = 
c("Reporter", 
"Partner", "Prodcat2", "Year"), drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame")))

And of split 3

dput(Headsplit3)

structure(list(Reporter = c("Algeria", "Algeria", "Algeria", 
"Algeria", "Algeria", "Algeria"), Partner = c("United Arab Emirates", 
"United Arab Emirates", "United Arab Emirates", "United Arab 
Emirates", 
"United Kingdom", "United Kingdom"), Year = c(2001L, 2001L, 2001L, 
2001L, 2001L, 2001L), Product.cat. = c("05", "052", "054", "05k", 
"05", "053"), `Commodity Description` = c("Fruit and vegetables", 
"Dried fruit including artificially dehydrated", "Vegetables, roots & 
tubers, fresh or dried", 
"Fruit and vegetables", "Fruit and vegetables", "Fruit,preserved and 
fruit preparations"
), `Trade Value` = structure(c(7.61814641291993e-319, 
7.4539189922423e-319, 
1.64178014113046e-320, 7.61814641291993e-319, 6.99399328252869e-320, 
3.16547859290487e-320), class = "integer64"), Prodcat1 = c("0", 
"0", "0", "0", "0", "0"), Prodcat2 = c("05", "05", "05", "05", 
"05", "05")), row.names = c(NA, -6L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = c("Reporter", "Partner", 
"Prodcat2", "Year"), drop = TRUE, indices = list(0:3, 4:5), 
group_sizes = c(4L, 
2L), biggest_group_size = 4L, labels = structure(list(Reporter = 
c("Algeria", 
"Algeria"), Partner = c("United Arab Emirates", "United Kingdom"
), Prodcat2 = c("05", "05"), Year = c(2001L, 2001L)), row.names = 
c(NA, 
-2L), class = "data.frame", vars = c("Reporter", "Partner", 
"Prodcat2", 
"Year"), drop = TRUE))

As you can see the code is able to identify that Algeria exports of 052 and 054 to the United Arab Emirates do not add up the exports of 05 - (the difference is only 1) and it does correctly creates a variable of 05k, yet the traded value of 05k is 154193 (= to the traded value of the whole 05) rather than being 1. Do you know why this could be the case?

LAP · Accepted Answer · 2019-01-08T07:45:20.970

1

Edit: Alright, I think I got it!

Data:

df <- data.frame( "Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA"), 
                  "Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU"), 
                  "Product cat." = c("1", "11","111", "12","2", "21", "211", "212", "22", "3"), 
                  "Val" = c(200, 170, 170, 30, 100, 50, 25, 5, 40, 220), stringsAsFactors = FALSE)

We begin by creating two auxiliary variables Prodcat1 and Prodcat2:

# create new variable Prodcat1 
df1 <- df %>% group_by(Reporter, Partner) %>% mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) 

# create new variable Prodcat2 for my 2nd level product category 
df1 <- df1 %>% group_by(Reporter, Partner) %>% mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))

Now we split the data in two parts, one to be completed, and one that does not need any operation on the third level:

# to be completed
df2 <- df1 %>%
  group_by(Reporter, Partner, Prodcat2) %>%
  filter(sum(Val[2:n()]) < Val[1])

# no operation on third level
df3 <- df1 %>%
  group_by(Reporter, Partner, Prodcat2) %>%
  filter(!sum(Val[2:n()]) < Val[1] | n() == 1)

We split df2 by Prodcat2, controlling for Reporter and Partner

split1 <- split(df2, interaction(df2$Reporter, df2$Partner, df2$Prodcat2))
split1 <- split1[sapply(split1, nrow) != 0]

and add a new row where necessary:

split2 <- lapply(split1, function(x){
  y <- rbind.data.frame(x, x[1,])
  y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k")
  y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
  return(y)
})

Then we put the data back together for the first time and sort it by the original Product.cat..

split3 <- do.call(rbind, split2)
newdf <- do.call(rbind, list(split3, df3))

newdf <- newdf %>%
  arrange(Product.cat.)

Data so far:

# A tibble: 11 x 6
# Groups:   Reporter, Partner, Prodcat2 [5]
   Reporter Partner Product.cat.   Val Prodcat1 Prodcat2
   <chr>    <chr>   <chr>        <dbl> <chr>    <chr>   
 1 USA      EU      1              200 1        NA      
 2 USA      EU      11             170 1        11      
 3 USA      EU      111            170 1        11      
 4 USA      EU      12              30 1        12      
 5 USA      EU      2              100 2        NA      
 6 USA      EU      21              50 2        21      
 7 USA      EU      211             25 2        21      
 8 USA      EU      212              5 2        21      
 9 USA      EU      21k             20 2        21      
10 USA      EU      22              40 2        22      
11 USA      EU      3              220 3        NA

Now we go on to the second level. First we create three parts:

# part to complete
df4 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) < 3) %>%
  filter(n() == 1 | sum(Val[2:n()]) < Val[1])

# third level rows, which are not necessary here
df5 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) == 3)

# second level part already complete
df6 <- newdf %>%
  group_by(Reporter, Partner, Prodcat1) %>%
  filter(nchar(Product.cat.) < 3) %>%
  filter(sum(Val[2:n()]) == Val[1])

We again split the data, now by Prodcat1, controlling for Reporter and Partner:

split3 <- split(df4, interaction(df4$Reporter, df4$Partner, df4$Prodcat1))
split3 <- split3[sapply(split3, nrow) != 0]

We create the new rows:

split4 <- lapply(split3, function(x){
  if(nrow(x) == 1){
    y <- rbind.data.frame(x, x)
    y[2, "Product.cat."] <- paste0(y[2, "Prodcat1"], "m")
  }else{
    y <- rbind.data.frame(x, x[1,])
    y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat1"], "k")
    y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"])
  }
  return(y)
})

and we stick it all back together, sort it again and drop the auxiliary variables.

split5 <- do.call(rbind, split4)
finaldf <- do.call(rbind, list(split5, df5, df6))

finaldf <- finaldf %>%
  ungroup() %>%
  arrange(Product.cat.) %>%
  select(-c("Prodcat1", "Prodcat2"))

Final data:

# A tibble: 13 x 4
   Reporter Partner Product.cat.   Val
   <chr>    <chr>   <chr>        <dbl>
 1 USA      EU      1              200
 2 USA      EU      11             170
 3 USA      EU      111            170
 4 USA      EU      12              30
 5 USA      EU      2              100
 6 USA      EU      21              50
 7 USA      EU      211             25
 8 USA      EU      212              5
 9 USA      EU      21k             20
10 USA      EU      22              40
11 USA      EU      2k              10
12 USA      EU      3              220
13 USA      EU      3m             220

At last, we clear our environment of all the temporary objects we needed

rm(df1, df2, df3, df4, df5, df6, newdf, split1, split2, split3, split4, split5)

which leaves us with the original dataset df and the final, completed dataset finaldata :)

edited Jan 08 '19 at 07:45

answered Jan 07 '19 at 14:26

LAP

6,605
2
15
28

Thanks really a lot!!!! it really looks like it works properly for the example :) !!! I try it immediately on the original data and get back to you if I see there are some issues! – Alex Jan 07 '19 at 14:34
Yes, you did indeed give me a lot of feed for though! but In the real data, I am struggling with the three levels. I am considering one level at the time, but I am still having trouble with the filtering and split function. consider this data: `df <- data.frame( "Reporter" = c("USA", "USA", "USA", "USA", "USA", "USA","USA", "USA", "USA","USA"), "Partner" = c( "EU", "EU","EU","EU", "EU","EU","EU", "EU","EU","EU"), "Product cat." = c("1", "11","111", "12","2", "21", "211", "212", "22", "3"), "Val" = c(200, 170, 170, 30, 100, 50, 25, 5, 40, 220), stringsAsFactors = FALSE)` – Alex Jan 07 '19 at 18:36
Then I try to consider differences in one level at the time. The final result should differentiate between 21k which is the difference between 21 and 211 + 212, and 2kk which is the difference between 2 and 211+212+21k. to do so I consider one level at the time `# create new variable Prodcat1 df1 <- df %>% group_by(Reporter, Partner) %>% mutate(Prodcat1 = str_extract(Product.cat., "^.{1}")) # create new variable Prodcat2 for my 2nd level product category df2 <- df1 %>% group_by(Reporter, Partner) %>% mutate(Prodcat2 = str_extract(Product.cat., "^.{2}"))`. – Alex Jan 07 '19 at 18:44
the problem emerge as I start to filter indeed the function code `df2.E <- df2 %>% group_by(Reporter, Partner, Prodcat2) %>% filter(n() == 2 | sum(Val[2:n()]) < Val[1])` gives me a strange set of rows, not the one I need to modify. moreover as I move on to split `split1 <- split(df2.E, interaction(df2$Reporter, df2$Partner, df2$Prodcat2))` I get an error message: as the lenght of my data is not a multiple of my variable of split. do you know by any chance what I am doing wrong? – Alex Jan 07 '19 at 19:08
I'll try to take a look at it today! – LAP Jan 08 '19 at 06:45
1

I think I made it work. Hope it also does on your full dataset! – LAP Jan 08 '19 at 07:45
Thanks a lot! you're a genius! It really looks as if you resolved the issue, I will work on it and let you know! – Alex Jan 08 '19 at 10:59
Hello, your code is really useful, and I was able to make clear progress. There are two things however that are still unclear (1) I am able to use the code just fine until the function. `split2 <- lapply(split1, function(x){ y <- rbind.data.frame(x, x[1,]) y[nrow(y), "Product.cat."] <- paste0(y[nrow(y), "Prodcat2"], "k") y[nrow(y), "Val"] <- x[1, "Val"] - sum(x[2:nrow(x), "Val"]) return(y) })` when I apply this to the real data, (that includes multiple countries and years - i.e. when I split df2 by Prodcat2, I control for Reporter and Partner and *Year*) I have troubles – Alex Jan 08 '19 at 14:10
What kind of trouble? – LAP Jan 08 '19 at 14:17
Basically when looking at `split3 <- do.call(rbind, split2)` I do individuate all the sub-categories that are incomplete, yet the value of the new product.cat **nk** is not the difference between value of the product.cat **nn** and *nnn* but simply reports the value of *nn*. Do you know why this may be the case? I am not sure what has gone wrong. – Alex Jan 08 '19 at 14:18
(2) the second is a minor concern regards reporting the **m** - i.e. the export reported only at higher levels - according to how many levels are not reported. Basically I have a whole set of categories, such as the one below `Comlist <- c("1", "11", "111", "112", "12","2", "21","211", "22","221", "3","31", "32", "311", "321") Comlist <- as.data.frame(Comlist)`, for which I want to control whether the data is reported only at higher levels. – Alex Jan 08 '19 at 14:32
What I would like is that, for instance in the case of the export between USA and EU, productcat 22, should become **22m** because I know a category 221 exist and it is not reported. Similarly, for category 3, that it is not reported in for **two** of its lower levels, I would like it to be **3mm**. and dealing , is that in the case of the second level – Alex Jan 08 '19 at 14:32
The main concern, however, rests in (1) hence, why the function does not report the subtracted value, but the total value of **nn**. If you have any insight on this it would really be of great use. – Alex Jan 08 '19 at 14:35
I'll take a look at it tomorrow! – LAP Jan 08 '19 at 15:24
Thanks a lot! you are really helping me out! – Alex Jan 08 '19 at 19:15
@Alessandro for the first problem: Could you extract an example from `split2` where the problem with **nk** occurs and put the data into your question (using `dput()` if possible)? – LAP Jan 09 '19 at 07:20
I have extracted the heads of split2 and split3 in `dput()` i think this is what you were asking for, please do let me know if there is anything else you need, and thanks again for your help! – Alex Jan 09 '19 at 11:50
Hey, are the values you work with always this small? I'm seeing a `Trade Value` of ~7-62e-319 for 05. In this region, you may be running into floating point errors. See here: https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal – LAP Jan 09 '19 at 12:36
Uhm, this is a little mysterious. Because I see what you say: in the code the traded value is ~7-62e-319, but when I plug the `dput` code in r for split3 I actually see as traded value **154193** which is the actual traded value has nothing to do with the ~7-62e-319 that is reported in the code... the same holds for `dput(Headsplit2)`.. So I am not sure what could it be.. the actual values are integer – Alex Jan 09 '19 at 16:17
Do I need any specific package installed to correctly show the `integer 64` class? – LAP Jan 10 '19 at 08:07
Sorry for the late reply, I had an emergency and I was not able to access my computer, I use `library(bit64)` does this fixes the problem? – Alex Jan 14 '19 at 14:11
Dear LAP, did you had some time to see whether the use of `library(bit64)` solved your problem? I would be really full of gratitude if you could help me with this – Alex Jan 18 '19 at 16:15

Hierarchical data aggregation and manipulation

1 Answers1