This is the second time I'm trying to formulate a question here - hopefully this time I'll make my self clear and in line with recommendations on this site. To the problem: I've a dataset on certain companies and their headquarters. The structure of the data comes a bit messy to me (Please see the link below) - even more problematic is that I've the data on 15 separate for the years 2003, 2007, 2011, 2015 and 2019 (three csv files for each year because of the size I guess).
For the purpose of this question I've merged three files into one (for the year 2003).
Now, what I want is this: 1) merge all the 15 files and from there 2) generate a set of variables that would indicate the total number of companies per country and year [note though that the year variable is not included as a variable].
Since I've the data on four main addresses, I'd like to create separate "sum variables" based on the order (1, 2, 3, 4) and, in addition, one variable that doesn't take into account the order of countries.
Just to give an example of how I'd like it to look like:
country year total_c1 total_c2 ...
USA 2003 100 100
USA 2007 150 120
CAN 2003 50 50
CAN 2007 100 60
I intend to merge this data with a panel data that I have (country-year data).
Please click on the link to access the data. Data sample for 2003. The first variable indicates the ID of companies. The second (country_1) means country of first address. The third (country_2) means country of second address and so on. After that, comes a bunch of variables (over 2800) indicating a single company in the dataset.
Now, what I've come up with in my attempt to do this in R (rather than doing manually). Credit to @Duck in helping me with the merging part.
myfun <- function(df)
{
#Code
new <- df %>%
pivot_longer(starts_with('country')) %>%
group_by(name) %>%
summarise_all(sum,na.rm=T)
return(new)
}
#Load files
myfiles <- list.files(pattern = '.csv')
#List of files
L <- lapply(myfiles, read.csv)
#Apply function
L <- lapply(L,myfun)
# turn to a df
df <- as.data.frame(L)
But this didn't work out for me since I couldn't figure out which year the data come from. Instead I merged the files for one year (for example 2003) and tried to create the variables I want by running this:
df2<- df %>%
mutate(Total_c1 = select(., A2654:U9340) %>% rowSums(na.rm = TRUE))
df3<–df2 %>% group_by(country_1) %>%
summarise(Total_c1=sum(Total_c1,na.rm = T)
And here I'm stuck. Any suggestion that can take me forward from here (and start from the right side) would be much appreciated!