How to simultaneously merge multiple csv files and summarize several variables per group

Question

This is the second time I'm trying to formulate a question here - hopefully this time I'll make my self clear and in line with recommendations on this site. To the problem: I've a dataset on certain companies and their headquarters. The structure of the data comes a bit messy to me (Please see the link below) - even more problematic is that I've the data on 15 separate for the years 2003, 2007, 2011, 2015 and 2019 (three csv files for each year because of the size I guess).

For the purpose of this question I've merged three files into one (for the year 2003).

Now, what I want is this: 1) merge all the 15 files and from there 2) generate a set of variables that would indicate the total number of companies per country and year [note though that the year variable is not included as a variable].

Since I've the data on four main addresses, I'd like to create separate "sum variables" based on the order (1, 2, 3, 4) and, in addition, one variable that doesn't take into account the order of countries.

Just to give an example of how I'd like it to look like:

country year    total_c1    total_c2   ...
USA     2003        100         100
USA     2007        150         120
CAN     2003        50          50
CAN     2007        100         60

I intend to merge this data with a panel data that I have (country-year data).

Please click on the link to access the data. Data sample for 2003. The first variable indicates the ID of companies. The second (country_1) means country of first address. The third (country_2) means country of second address and so on. After that, comes a bunch of variables (over 2800) indicating a single company in the dataset.

Now, what I've come up with in my attempt to do this in R (rather than doing manually). Credit to @Duck in helping me with the merging part.

myfun <- function(df)
{
  #Code
  new <- df %>%
    pivot_longer(starts_with('country')) %>%
    group_by(name) %>%
    summarise_all(sum,na.rm=T)
  return(new)
}
#Load files
myfiles <- list.files(pattern = '.csv')

#List of files
L <- lapply(myfiles, read.csv)

#Apply function
L <- lapply(L,myfun)

# turn to a df
df <- as.data.frame(L)

But this didn't work out for me since I couldn't figure out which year the data come from. Instead I merged the files for one year (for example 2003) and tried to create the variables I want by running this:

  df2<- df %>%
  mutate(Total_c1 = select(., A2654:U9340) %>% rowSums(na.rm = TRUE))

  df3<–df2 %>% group_by(country_1) %>%
  summarise(Total_c1=sum(Total_c1,na.rm = T)

And here I'm stuck. Any suggestion that can take me forward from here (and start from the right side) would be much appreciated!

Is the year in the filename? If so then you'll have to extract the year from the filename (using regular expression or the position) and then use lapply or map to add a column to each table in your list before you bind the rows together. — Michael Dewar, Jan 04 '21 at 22:21
@Ferenc: check this solution to see if it would help https://stackoverflow.com/a/48105838/786542 — Tung, Jan 05 '21 at 00:22

score 0 · Accepted Answer · answered Jan 05 '21 at 04:21

0

You can try the following part of the code assuming all your csv files that you want to combine are in the working directory itself.

library(tidyverse)

myfiles <- list.files(pattern = '.csv')

map_df(myfiles, function(x) {
  year_number <- readr::parse_number(x)
  df <- read.csv2(x)
  df %>%
    mutate(Total = rowSums(select(., -(1:5)), na.rm = TRUE)) %>%
    pivot_longer(cols = starts_with('country')) %>%
    group_by(name, value) %>%
    summarise(Total  = sum(Total)) %>%
    pivot_wider(names_from = name, values_from = Total) %>%
    mutate(year = year_number)
}) %>%
  arrange(country, year) -> result

result

answered Jan 05 '21 at 04:21

Ronak Shah

377,200
20
156
213

Thanks Ronak. I've got this A tibble: `0 x 0` when running the code. Something went wrong. The df doesn't contain the year variable either. – Ferenc Jan 05 '21 at 09:51
What does `myfiles` return? – Ronak Shah Jan 05 '21 at 09:52
`[1] "UIA data 2003 part 1.CSV" "UIA data 2003 part 2.CSV" "UIA data 2003 part 3.CSV" "UIA data 2007` etc – Ferenc Jan 05 '21 at 09:56
When I run the last line, I get this: `Error in mutate_impl(.data, dots, caller_env()) : variable names are limited to 10000 bytes` – Ferenc Jan 05 '21 at 09:58
So you have multiple files for the same year? Does reading data work correctly for 1 file? `df <- read.csv2(myfiles[1])` works without error? Can you check if `df` is as expected? After that try to work with only 2 files first. `myfiles <- myfiles[1:2]` – Ronak Shah Jan 05 '21 at 10:02
To make it easy, I've now only one file for each year. However, when running the code I get this: Error: object 'country' not found. Moreover, it also gives me these warnings:5: Factor `value` contains implicit NA, consider using `forcats::fct_explicit_na` 6: In bind_rows_(x, .id) : Unequal factor levels: coercing to character 7: In bind_rows_(x, .id) : – Ferenc Jan 05 '21 at 11:24
Then why do you have `"UIA data 2003 part 1.CSV"` and `"UIA data 2003 part 2.CSV"` ? Which year does it represent? Also I think you are on R < 4.0.0 in which case you should change `df <- read.csv2(x)` to `df <- read.csv2(x, stringsAsFactors = FALSE)` – Ronak Shah Jan 05 '21 at 14:32
Hi Ronak! Thanks for trying to help me with this. I think we're almost done. Now, the only problem is that it gives the error: "object 'country' not found". This is strange because the first column starts with "country_1", and then "country_2" etc. What am I missing here? Note: I don't have multiple files for the same year anymore, I fixed it so that it's only one file per year. – Ferenc Jan 05 '21 at 21:27
I've solved it by getting rid of other "country" variables. thanks for your help and patience! – Ferenc Jan 05 '21 at 21:44

score 0 · Answer 2 · answered Jan 05 '21 at 17:25

You have asked for help on different problems here. I will answer just one. With the data.table library, efficiently read in many CSVs in the same directory with the same or nearly same column titles. This produces one object (l1):

library(data.table)
# setDTthreads() # use some appropriate integer
# unzip all the files you want row bound .... to this directory
setwd("D:/Politics/General.2020/BallotReturnStats/11.24.2020")
l1 <- as.data.table({})
for(i in dir()) {l1 <- rbind(l1,fread(i),fill=TRUE)}

How to simultaneously merge multiple csv files and summarize several variables per group

2 Answers2