-1

I am working with a large dataset. My goal is to get the sum for certain events (as coded in the dataset) for particular countries over time. The dataset is so large that I have to load it by month with a function. The data is from the GDELT dataset, which is available here: http://gdelt.utdallas.edu/data/backfiles/?O=D I have converted the csv's to Rdata for quicker reading and writing. It is a dataset with 57 different variables.

# Create empty dataframes for all countries to later store data in.
Countries <- c("MAR","DZA","TUN","LBY","EGY","ISR",
               "JOR","SYR","TUR","GEO","UKR","RUS","BLR")
loadNames <- function(CountryName) {
  a <- data.frame()
  assign(CountryName, a, pos = .GlobalEnv)
}
lapply(Countries,loadNames)

loadMonth <- function(MonthName) {
  pb <- txtProgressBar(min = 0, max = total, initial = 0, char = "=", style = 1, width = 10)

  # Load the month.
  load(paste("/Users/mennoschellekens/Dropbox/HCSS-workinprogress/GDELT/Rdata/",MonthName,".RData", sep = ""), envir=environment())
  colnames(Month) <- names(Header.57)

  # Create a subset of relevant data for faster looping.
  y <- subset(Month, ((Actor1CountryCode == "SYR" | Actor1CountryCode =="MAR" | Actor1CountryCode =="DZA" | Actor1CountryCode == "TUN" | Actor1CountryCode == "LBY" | Actor1CountryCode == "EGY" | Actor1CountryCode == "ISR" | Actor1CountryCode == "JOR" | Actor1CountryCode == "TUR" | Actor1CountryCode == "GEO" | Actor1CountryCode == "UKR" | Actor1CountryCode == "RUS" | Actor1CountryCode == "BLR") & (Actor2CountryCode == "SYR" | Actor2CountryCode == "MAR" | Actor2CountryCode == "DZA" | Actor2CountryCode == "TUN" | Actor2CountryCode == "LBY" | Actor2CountryCode == "EGY" | Actor2CountryCode == "ISR" | Actor2CountryCode == "JOR" | Actor2CountryCode == "TUR" | Actor2CountryCode == "GEO" | Actor2CountryCode == "UKR" | Actor2CountryCode == "RUS" | Actor2CountryCode == "BLR")))

  #Define the events I want. 
  QuadCat <- c(1,2,3,4)

  # Define the countries I want.
  CountryString <- c("MAR","DZA", "TUN","LBY","EGY","ISR",
                     "JOR","SYR","TUR","GEO","UKR","RUS","BLR")
  CountryData <- c(MAR,DZA,TUN,LBY,EGY,ISR,JOR,SYR,TUR,GEO,UKR,RUS,BLR)

  # I want to check the above events for each country, using the function 'Check Events' with an embedded 'for loop'.
  CheckEvents <- function(CountryData,CountryString) {
    x <- subset(y, ((Actor1CountryCode == CountryString) & (Actor2CountryCode == CountryString)))

    # This is the problem:
    for (Y in QuadCat) {
      e[[Y]] <- (sum(x$QuadClass == Y))
      e <- rbind(CountryData,c(e))
      assign(CountryString, as.data.frame(e), pos = .GlobalEnv)
    }
  }
  mapply(CheckEvents, CountryData = CountryData, CountryString = CountryString)    
} ###### END

The output of the first run gives a vector with the four numbers, which is good, and then stored in the Global Environment. However, when I try to bind that result to a new result with rbind(), cbind() or merge(), it gives me very strange results. Most notably, it refuses to read CountryData as a vector, but only takes the last value in the vector. I don't understand what it is doing and why it won't bind.

mhschel
  • 75
  • 1
  • 7
  • The object `Month` of which you attempt to take a subset is not defined. – Thomas Jul 26 '13 at 11:41
  • The `loadNames` function also seems unnecessary, or at least inefficient. – Thomas Jul 26 '13 at 11:43
  • Because it is formatted in Rdata, loading it already creates the object 'Month' in my workspace. Therefore, this code works for me without defining the object. – mhschel Jul 26 '13 at 11:55
  • 1
    Could you post some data, or at least some similar data? Also, where is `e` defined? – Peyton Jul 26 '13 at 12:16
  • The dataset has 57 variables, so I won't post it here. Given it is publicly available, it put in a link to the data above. I have not defined 'e', but ' e[[Y]] <- (sum(x$QuadClass == Y))' produces a nice vector, so I didn't think it was necessary. – mhschel Jul 26 '13 at 12:51
  • 1
    Please read the [FAQ](http://stackoverflow.com/a/5963610/1412059) on how to provide a minimal reproducible example. – Roland Jul 26 '13 at 13:08

1 Answers1

0

I think the below will get you what you want, as a list. You can then reshape it however you want. Your example code, however, is incredibly confusing and involves some potentially dangerous elements like using assign and subset inside functions. You also seemed to have a lot of superfluous code. I imagine all of these things are what was causing you problems.

Countries <- c("MAR","DZA","TUN","LBY","EGY","ISR",
               "JOR","SYR","TUR","GEO","UKR","RUS","BLR")

loadMonth <- function(MonthName) {
  pb <- txtProgressBar(min = 0, max = total, initial = 0, char = "=", style = 1, width = 10)

  # Load the month
  load(paste("/Users/mennoschellekens/Dropbox/HCSS-workinprogress/GDELT/Rdata/",MonthName,".RData", sep = ""), envir=environment())
  colnames(Month) <- names(Header.57)

  y <- Month[(Month$Actor1CountryCode %in% Countries) &
             (Month$Actor2CountryCode %in% Countries),] # subset
  CheckEvents <- function(CountryString) {
    x <- y[(y$Actor1CountryCode == CountryString) &
           (y$Actor2CountryCode == CountryString),]
    # 1:4 below was your values of QuadCat
    sapply(1:4, function(Y) sum(x$QuadClass == Y)) # should return a vector
  }
  # build list of vectors from `CheckEvents`, one for each country
  out <- lapply(Countries, CheckEvents)
  names(out) <- Countries
  return(out)
}
loadMonth("January") # get the list for one month; not sure how you have months named
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • That is very helpful, thank you! I do have a follow up question. How can I save the vectors per country, each vector (corresponding with each month) becoming a row? I would like to end up with object 'SYR' with 162 rows (for my 162 months). – mhschel Jul 26 '13 at 18:09
  • Try something like `do.call(rbind, object)` where `object` is the output of your `loadMonth()` function. – Thomas Jul 26 '13 at 18:11