Need advice on using R to clean up data

Question

I have multiple same format csv files that I need to combine but before that

Header is not the first row but 4th row. Should I remove first 3 row by skip? Or should I reassign the header?
I need to add in a column which is the ID of the file (same as file name) before I combine.
Then I need to extract only 4 columns from a total of 7.
Sum up numbers under a category.
Combine all csv files into one.

This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?

files = list.files(pattern = "*.csv", full.names = TRUE)

library("tidyverse")
library("dplyr")

data = data.frame()

for (file in files){
    temp <- read.csv(file, skip=3, header = TRUE)
    colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
    temp <- temp [, c("Volume", "Category", "Surpass Object")]
    temp <- subset(temp, Category =="Surface")
    mutate(id = file)
    aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
    
}

And I got an error:

Error in is.data.frame(.data) : 
  argument ".data" is missing, with no default

The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.

I am quite new to R and really appreciate all the comments that I can get here.

Thanks in advance!

You are definitely missing calling the dataframe in `mutate`. If you are trying to do it on `temp`, then you need to add in the pipe. `temp <- subset(temp, Category =="Surface") %>% mutate(id = file)` — AndrewGB, Dec 03 '21 at 01:28
You're also doing all of this calculation and then discarding the results, never capturing into an object that persists. See https://stackoverflow.com/a/24376207/3358227 for good discussions on operating on lists of frames, i.e., doing things like reading in multiple files and working on the datasets within a list. In the case here, we don't need to keep them separate (but absolutely can if you'd prefer), but the premise and other guidance on that page still applies. — r2evans, Dec 03 '21 at 01:35
@AndrewGillreath-Brown Thanks for the comment. I tried the code and for some reason there's only one file name showed in the id column, not sure if other file name got replaced by the same one? — Meifong, Dec 03 '21 at 01:37

r2evans · Answer 1 · 2021-12-03T12:55:04.627

2

Since you appear to be trying to use dplyr, I'll stick with that theme.

library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
  select(filename, Category, Volume, Surpass) %>% # no idea why you want Surpass
  group_by(filename, Category) %>%
  summarize(Volume = sum(Volume))                 # Surpass is discarded here

Walk-through:

purrr::map_dfr iterates our function (read.csv(...)) over each of the inputs (each file in files) and row-concatenates it. Since we named the files with themselves (setNames(nm=files) is akin to names(files) <- files), we can use id="filename" which adds a "filename" column that reflects from which file each row was taken.
select(...) whatever four columns you said you needed. Frankly, since you're aggregating, we really only need c("filename", "Category", "Volume"), anything else and you likely have missed something in your explanation.
group_by(..) will allow us to get one row for each filename, each Category, where Volume is a sum (calculated in the next step, summarize).

edited Dec 03 '21 at 12:55

answered Dec 03 '21 at 01:31

r2evans

141,215
6
77
149

Thanks for the answer and your comment. The reason I do it this way is because of the header problem and I tried combine all csv files before and after combine the ID (or the filename) was disappeared. And yes for your point no.2, I've missed out an important piece of information. the original file has an ID column which is not the filename. I intend to use filename as real ID. For this do you think I should assign a different name for the column of real ID? I need the Surpass object as one of the columns because there are two types of data in there. – Meifong Dec 03 '21 at 02:01
If you need `"Surpass"`, then do you need to group on that as well? Summarizing cannot work on fields that are neither (a) one of the grouping variables, nor (b) calculated in the summarizing. The filename as an id is being added here in my answer. Other than your fourth column, I think this code gives you what you are asking for, is that right? – r2evans Dec 03 '21 at 02:04
I ran the code and it gave me an error. Error: Must group by variables found in `.data`. * Column `filename` is not found. Any idea? Thanks! – Meifong Dec 03 '21 at 03:22
It means your `select(..)` explicitly omitted `filename`. I'm editing this answer to include it for explicitness, but I still don't know the rest of your column names needed. I'll guess, it is really frustrating to have an incomplete problem to solve. – r2evans Dec 03 '21 at 12:54
Thanks r2evans. The four columns in Select() are the ones I needed and in the end I will sum up according to volume. Surpass is kind of subcategory of Category that's why i cannot omit it. I ran the codes again but it has an error where Surpass doesn't exist. Anyway, thanks for your comments. – Meifong Dec 06 '21 at 23:35

Kat · Accepted Answer · 2021-12-03T15:51:47.427

You can use read.csv(), but if there are many files, I suggest using the fread() from the data.table package. It is significantly faster. I used fread() here, but it will still work if you switch it out for read.csv(). fread() is more advanced, as well. You will find that even things like skip can sometimes be left out, and it will still be read correctly.

library(tidyverse)
library(data.table)

add_filename <- function(flnm){
    fread(flnm, skip = 3) %>%   # read file
    mutate(id = basename(flnm)) # creates new col id w/ basename of the file 
}

# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
    map_df(~add_filename) %>%
    select(id, Volume, Category, `Surpass Object`)

I get the impression that you wanted to aggregate but keep the consolidated data frame, as well. If that's the case, you'll keep the aggregation separate from building the data frame.

df %>%       # not assigned to a new object, so only shown in console
    filter(Category == "Surface") %>%  # filter for the category desired
    {sum(.$Volume)}                    # sum the remaining values for volume

If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data. The simplest way (perhaps not the best way) to explain the {} is that sum() is not designed to handle data frames - therefore isn't inherently friendly with dplyr piping.

If you wanted the sum of volume for every category instead of only "Surface" that you had coded in your question, then you would use this instead:

df %>% 
    group_by(Category) %>%
    summarise(sum(Volume))

Notice I used the British spelling of summarize here. The function summarize() is in a lot of packages. I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr function that I've called. (tidyverse accepts the American and British spelling for nearly all functions, I think.)

Thanks Kat! Your answer is very helpful. I'm very happy to be able to add the id column in. However, for the last part, the code tends to sum up everything in the column instead of according to category. So I have two categories (A and B) and Surpass object is kind of the subcategory (A1, A2, B1 and B2), for this part I just wanna sum up A1, A2, B1 and B2. — Meifong, Dec 03 '21 at 03:49
Can you send me a snapshot of your data–[a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? It will be a lot easier for me to understand what's happening. — Kat, Dec 03 '21 at 15:50
Hi Kat, thanks for following up. I managed to get the codes work after advice from a colleague. Basically it was me missing out something but your lines worked! Also for map_df(add_filename) works instead of the one with ~. Thanks! — Meifong, Dec 06 '21 at 23:23

Need advice on using R to clean up data

2 Answers2