I am working with a large dataset. My goal is to get the sum for certain events (as coded in the dataset) for particular countries over time. The dataset is so large that I have to load it by month with a function. The data is from the GDELT dataset, which is available here: http://gdelt.utdallas.edu/data/backfiles/?O=D I have converted the csv's to Rdata for quicker reading and writing. It is a dataset with 57 different variables.
# Create empty dataframes for all countries to later store data in.
Countries <- c("MAR","DZA","TUN","LBY","EGY","ISR",
"JOR","SYR","TUR","GEO","UKR","RUS","BLR")
loadNames <- function(CountryName) {
a <- data.frame()
assign(CountryName, a, pos = .GlobalEnv)
}
lapply(Countries,loadNames)
loadMonth <- function(MonthName) {
pb <- txtProgressBar(min = 0, max = total, initial = 0, char = "=", style = 1, width = 10)
# Load the month.
load(paste("/Users/mennoschellekens/Dropbox/HCSS-workinprogress/GDELT/Rdata/",MonthName,".RData", sep = ""), envir=environment())
colnames(Month) <- names(Header.57)
# Create a subset of relevant data for faster looping.
y <- subset(Month, ((Actor1CountryCode == "SYR" | Actor1CountryCode =="MAR" | Actor1CountryCode =="DZA" | Actor1CountryCode == "TUN" | Actor1CountryCode == "LBY" | Actor1CountryCode == "EGY" | Actor1CountryCode == "ISR" | Actor1CountryCode == "JOR" | Actor1CountryCode == "TUR" | Actor1CountryCode == "GEO" | Actor1CountryCode == "UKR" | Actor1CountryCode == "RUS" | Actor1CountryCode == "BLR") & (Actor2CountryCode == "SYR" | Actor2CountryCode == "MAR" | Actor2CountryCode == "DZA" | Actor2CountryCode == "TUN" | Actor2CountryCode == "LBY" | Actor2CountryCode == "EGY" | Actor2CountryCode == "ISR" | Actor2CountryCode == "JOR" | Actor2CountryCode == "TUR" | Actor2CountryCode == "GEO" | Actor2CountryCode == "UKR" | Actor2CountryCode == "RUS" | Actor2CountryCode == "BLR")))
#Define the events I want.
QuadCat <- c(1,2,3,4)
# Define the countries I want.
CountryString <- c("MAR","DZA", "TUN","LBY","EGY","ISR",
"JOR","SYR","TUR","GEO","UKR","RUS","BLR")
CountryData <- c(MAR,DZA,TUN,LBY,EGY,ISR,JOR,SYR,TUR,GEO,UKR,RUS,BLR)
# I want to check the above events for each country, using the function 'Check Events' with an embedded 'for loop'.
CheckEvents <- function(CountryData,CountryString) {
x <- subset(y, ((Actor1CountryCode == CountryString) & (Actor2CountryCode == CountryString)))
# This is the problem:
for (Y in QuadCat) {
e[[Y]] <- (sum(x$QuadClass == Y))
e <- rbind(CountryData,c(e))
assign(CountryString, as.data.frame(e), pos = .GlobalEnv)
}
}
mapply(CheckEvents, CountryData = CountryData, CountryString = CountryString)
} ###### END
The output of the first run gives a vector with the four numbers, which is good, and then stored in the Global Environment. However, when I try to bind that result to a new result with rbind()
, cbind()
or merge()
, it gives me very strange results. Most notably, it refuses to read CountryData
as a vector, but only takes the last value in the vector. I don't understand what it is doing and why it won't bind.