2

I have a script that loops through multiple years of data, one year at a time. Each year of data consists of multiple dataframes that are placed in a list called all_input. At the beginning of the loop (after the data is read in), I am trying to get all of the years of data in the same format before the rest of the processing.

The issue I am having is with column names, which are not uniform. There are 5 columns included in each dataframe that I want to keep, and I want them to be called total_emissions uom tribal_name st_usps_cd and description. In some dataframes they already have these names, while in others they have various names such as pollutant.desc or pollutant_desc, for example.

My current approach is this:

# Create a mapping file for the column names
    header_map <- data.frame(orignal_col = c( "pollutant_desc", "pollutant.desc", "emissions.uom", "total.emissions", "tribal.name", "state" ), 
                               new_col = c( "description", "description", "uom", "total_emissions", "tribal_name", "st_usps_cd" ), stringsAsFactors = FALSE)

    # change the column names
    lapply(all_input, function(x) {
      names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col
      x
    }) -> all_input

Which creates a header mapping file that looks like this:

original_col         new_col
pollutant_desc       description
pollutant.desc       description
emissions.uom        uom
total.emissions      total_emissions
tribal.name          tribal_name
state                st_usps_cd

The error I am getting is as follows:

Error in names(x)[match(header_map$orignal_col, names(x))] <- header_map$new_col : 
  NAs are not allowed in subscripted assignments

I understand that as I will have to manually add entries to the header file as new years of data with different column names are processed, but how can I get this to work?

Fake Sample Data. df1 and df2 represent the format of the "2017" data, where multiple columns need name changes, but the current names are consistent between dataframes. df3 represents "2011" data, where all of the column names are as they should be. df4 represents "2014" data, where the only column that needs to be changed is pollutant_desc. Note, there are extra columns in each dataframe that are not needed and can be ignored. And reminder, these dataframes are not all read at the same time. The loop is by year, so df1 and df2 (in list all_input) will be formatted and processed. Then all of the data is removed, and a new all_input list is created with the next years dataframes, which will have different column names. The code must work for all years without being changed.

> dput(df1)
structure(list(total.emissions = structure(1:2, .Label = c("100", 
"300"), class = "factor"), emissions.uom = structure(1:2, .Label = c("LB", 
"TON"), class = "factor"), international = c(TRUE, TRUE), hours = structure(2:1, .Label = c("17", 
"3"), class = "factor"), tribal.name = structure(2:1, .Label = c("FLLK", 
"SUWJG"), class = "factor"), state = structure(1:2, .Label = c("AK", 
"MN"), class = "factor"), pollutant.desc = structure(1:2, .Label = c("Methane", 
"NO2"), class = "factor"), policy = c(TRUE, FALSE)), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df2)
structure(list(total.emissions = structure(2:1, .Label = c("20", 
"400"), class = "factor"), emissions.uom = structure(c(1L, 1L
), .Label = "TON", class = "factor"), international = c(FALSE, 
TRUE), hours = structure(2:1, .Label = c("1", "8"), class = "factor"), 
    tribal.name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"), 
    state = structure(2:1, .Label = c("SD", "WY"), class = "factor"), 
    pollutant.desc = structure(1:2, .Label = c("CO2", "SO2"), class = "factor"), 
    policy = c(FALSE, FALSE)), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df3)
structure(list(total_emissions = structure(2:1, .Label = c("200", 
"30"), class = "factor"), uom = structure(c(1L, 1L), .Label = "TON", class = "factor"), 
    boundaries = structure(2:1, .Label = c("N", "Y"), class = "factor"), 
    tribal_name = structure(2:1, .Label = c("SOSD", "WMFJU"), class = "factor"), 
    st_usps_cd = structure(2:1, .Label = c("ID", "KS"), class = "factor"), 
    description = structure(c(1L, 1L), .Label = "SO2", class = "factor"), 
    policy = c(FALSE, TRUE), time = structure(1:2, .Label = c("17", 
    "7"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))
> dput(df4)
structure(list(total_emissions = structure(2:1, .Label = c("700", 
"75"), class = "factor"), uom = structure(c(1L, 1L), .Label = "LB", class = "factor"), 
    tribal_name = structure(1:2, .Label = c("SSJY", "WNCOPS"), class = "factor"), 
    st_usps_cd = structure(1:2, .Label = c("MO", "NY"), class = "factor"), 
    pollutant_desc = structure(2:1, .Label = c("CO2", "Methane"
    ), class = "factor"), boundaries = structure(c(1L, 1L), .Label = "N", class = "factor"), 
    policy = c(FALSE, FALSE), time = structure(1:2, .Label = c("2", 
    "3"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Thank you!

Maridee Weber
  • 231
  • 1
  • 8
  • `match` returns `NA` when one of its arguments is not found. Find columns that are not found and you'll be good. – r2evans Jun 16 '20 at 16:44
  • @r2evans Certain columns are not found because they are not present in every year. For example, "pollutant_desc" is used in the 2014 data, but "pollutant.desc" is used in the 2017 data. This is the issue I am trying to overcome – Maridee Weber Jun 16 '20 at 17:11
  • It would help to see sample data to be able to test these theories. – r2evans Jun 16 '20 at 17:15
  • @r2evans I would like to post data, but each year is 3.7Gb or more – Maridee Weber Jun 16 '20 at 17:16
  • Use **fake** data. They don't have to be big, but generate at least 2-3 `data.frame`s, each with 1-2 rows and some form of *difference*, where all names are found in one and not in another. – r2evans Jun 16 '20 at 17:18
  • Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Jun 16 '20 at 17:19
  • @r2evans thank you, I made four fake dataframes to use. – Maridee Weber Jun 16 '20 at 17:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216085/discussion-between-maridee-weber-and-r2evans). – Maridee Weber Jun 16 '20 at 19:07

1 Answers1

2

Try this:

list_of_frames1 <- list(df1, df2, df3, df4)
list_of_frames2 <- lapply(list_of_frames1, function(x) {
  nms <- intersect(names(x), header_map$orignal_col)
  names(x)[ match(nms, names(x)) ] <- header_map$new_col[ match(nms, header_map$orignal_col) ]
  x
})
r2evans
  • 141,215
  • 6
  • 77
  • 149