0

I am having trouble merging datasets and am going to merging many together so need to figure out a way to automate getting through the following error:

"Error: Can't combine `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2006_72038163885.csv$HourlyWetBulbTemperature` <double> and `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2009_72038163885.csv$HourlyWetBulbTemperature` <character>."

I have examined the data and see that in one of the files some of the NAs are marked by * so I know that is why the problem is there. I would like to add a command that will convert either all to character or all to numeric so that I can merge but when I try adding as.character I receive this error:

Error: Names repair functions can't return `NA` values

Here is the relevant code I am trying to run which produces the error.

library(rnoaa)
library(tidyverse)
library(fs)

super_big_df <- map_df(my_files, read_csv, col_select = c(1,2,21,32,80), col_types = "cTddd", .id = "file")

Here is the output of dput for the relevant columns of the dataset

structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA, 
NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
structure(list(STATION = c(72038163885, 72038163885, 72038163885
), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_, 
NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA, 
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", 
"data.frame"))

In sum, I am wondering if there is a command I can put in map_df() that will convert everything to be the same (either character or numeric) so that the rest of the command will still run.

  • 1
    Asking the same question after it's been closed, and with literally no change, is generally a bad idea and may trigger strong reactions from others. While "we" (the community) did not do you a great service by expanding on the close-reason ("needs details"), I would have hoped that you would infer that we can do nothing without knowing what is in the files. Your issue has to do with combining frames, so you are adding unnecessary cruft by asking us to go through the NOAA-query stuff; and since I don't have that package installed, it isn't happening. – r2evans Dec 01 '21 at 14:45
  • 1
    Please read about reproducible questions (https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info), and then ask a question with *sample data* so that we can focus on the problem at hand. This is likely best done using `dput(x)` for objects (where `x` is a small and representative *sample* of a larger object), or by posting the first `n` lines of the text file in a code block (`n` is subjective, but please make sure that the sample you give us correctly "fails" in the same way). – r2evans Dec 01 '21 at 14:48
  • 1
    I suggest you delete this question yourself *temporarily*; this will prevent downvotes while you are improving the question. After deleting it, you can still [edit] it at will, make the suggested changes, and then undelete it when you are ready again. When you do so, please @ping me and I'll come take a look. Thank you! – r2evans Dec 01 '21 at 14:49
  • The only directly relevant info for the question is the two columns in question. Apparently one is `character` class and one is `numeric` (double). Instead of 60+ letter long names, let's call the data frames `df1` and `df2`. You should look through some of the `character` column `df2$HourlyWetBulbTemperature` to see **why** it's character rather than numeric. Maybe add a new column `df2$HWBT_numeric <- as.numeric(df2$HourlyWetBulbTemperature)` and look at the rows where the `numeric` version is `NA` to see what the values are in the original that can't be converted.... – Gregor Thomas Dec 01 '21 at 14:55
  • Then, if you still need help, share with us a **minimal** reproducible example, say 2 or 3 rows that do work and 2 or 3 rows that don't work. Share it in a friendly copy/pasteable way as described in r2evans's links above. – Gregor Thomas Dec 01 '21 at 14:57
  • @r2evans thank you so much for your guidance. I was confused about why the first one was deleted as it had some suggested edits that I made (hence why they looked the same) but the edits were just to the text I had used to explain the problem. So your feedback was very helpful. I am new here so I appreciate your patience. I read the resource you sent and used dput so I am hopeful that this works. Please do let me know if that is not the case and I will make edits. Thank you again! – Gabriela Nagle Dec 01 '21 at 16:03
  • @GregorThomas Thanks for your help! I took a look as you suggested and found that some NAs are being saved as *. I thought that would mean my best option would be to convert the other df into character so they could merge and then I could handle the * later but I still receive an error when I do as.character() for all of the files in map_df(). Do you have any suggestions? – Gabriela Nagle Dec 01 '21 at 16:05
  • 1
    The best time to handle them is when you import the file. Most import commands have an argument called `na.strings` or `na` or similar that lets you specify values that can be treated as `NA`. So add `"*"` to that default and that should read your data in properly, and solve these downstream problems. – Gregor Thomas Dec 01 '21 at 16:14
  • Also, from your terribly long data frame names, I assume you're reading them in programmatically from full file paths. You can use `basename()` to get just the file name and drop the full path to the file from the string, which may help you rename `C:/Users/gabri/AppData/Local/Cache/R/noaa_lcd/2006_72038163885.csv` to `2006_72038163885.csv` and make your code much more readable. – Gregor Thomas Dec 01 '21 at 16:16
  • @r2evans thank you both so much! The "na" argument worked and the files were able to merge! Now on to seeing if it will work on all 3,000 of them...thank you! – Gabriela Nagle Dec 01 '21 at 17:34

1 Answers1

1

Untested, but the best way forward as @GregorThomas suggested is to read it in properly the first time. In this case, it's likely something like:

super_big_df <- map_df(
  my_files, read_csv, na = c("", "NA", "*"),
  col_select = c(1,2,21,32,80), col_types = "cTddd",
  .id = "file")

If you need to fix it after the fact, then you'll need to read them into a list-of-frames, perhaps changing map_df to map,

super_big_df <- map(
  my_files, read_csv, na = c("", "NA", "*"),
  col_select = c(1,2,21,32,80), col_types = "cTddd",
  .id = "file")
bind_rows(super_big_df)
# Error: Can't combine `..1$HourlyWetBulbTemperature` <character> and `..2$HourlyWetBulbTemperature` <double>.

and then something like

library(dplyr) # in case you did not already have it loaded
purrr::map(super_big_df, ~ mutate(., HourlyWetBulbTemperature = suppressWarnings(as.numeric(HourlyWetBulbTemperature)))) %>%
  bind_rows()
# # A tibble: 6 x 5
#       STATION DATE                HourlyWetBulbTemperature DailyAverageWetBulbTemperature MonthlyWetBulb
#         <dbl> <dttm>                                 <dbl>                          <dbl> <lgl>         
# 1 72038163885 2009-01-01 00:00:00                       NA                             NA NA            
# 2 72038163885 2009-01-01 00:20:00                       38                             NA NA            
# 3 72038163885 2009-01-01 00:40:00                       37                             NA NA            
# 4 72038163885 2006-05-01 05:00:00                       NA                             72 NA            
# 5 72038163885 2006-05-01 05:20:00                       NA                             NA NA            
# 6 72038163885 2006-05-01 05:40:00                       NA                             NA NA            

The suppressWarnings here is because we know there is a non-number ("*") in that column somewhere. For that one frame, it will fix that column; for other frames, it should be a no-op since the column is already as.numeric.

Note that I hard-coded the name here since we know what it is ahead of time. If there are more columns that need repairing (i.e., you get more errors after fixing this one), then it might be advantageous to go with a more dynamic/programmatic approach (not yet covered here).


Data

super_big_df <- list(
  structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1230768000, 1230769200, 1230770400), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c("*", "38", "37"), DailyAverageWetBulbTemperature = c(NA, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame")),
  structure(list(STATION = c(72038163885, 72038163885, 72038163885), DATE = structure(c(1146459600, 1146460800, 1146462000), tzone = "UTC", class = c("POSIXct", "POSIXt")), HourlyWetBulbTemperature = c(NA_real_, NA_real_, NA_real_), DailyAverageWetBulbTemperature = c(72, NA, NA), MonthlyWetBulb = c(NA, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
)
r2evans
  • 141,215
  • 6
  • 77
  • 149