1

I have a number of CSV files exported from our database, say site1_observations.csv, site2_observations.csv, site3_observations.csv etc. Each CSV looks like below (site1 for example):

Column A Column B Column C
# Team: all teams
# Observation type: xyz
Site ID Reason Quantity
a xyz 1
b abc 2
Total quantity 3

We need to skip the top 2 rows and the last 1 row from each CSV before combining them as a whole master dataset for further analysis. I know I can use the skip = argument to skip the first few lines of CSV, but read_csv() doesn't seem to have simple argument to skip the last lines and I have been using n_max = as a workaround. The data import has been done in manual way. I want to shift the manual process to programmatic manner using purrr::map(), but just couldn't work out how to efficiently skip the last few lines here.

library(tidyverse)

observations_skip_head <- 2

# Approach 1: manual ----
site1_rawdata <- read_csv("/data/site1_observations.csv", 
                          skip = observations_skip_head, 
                          n_max = nrow(read_csv("/data/site1_observations.csv", 
                                                skip = observations_skip_head))-1)

# site2_rawdata
# site3_rawdata
# [etc]

# all_sites_rawdata <- bind_rows(site1_rawdata, site2_rawdata, site3_rawdata, [etc])

I have tried to use purrr::map() and I believe I am almost there, except the n_max = part which I am not sure how/what to do this in map() (or any other effective way to get rid of the last line in each CSV). How to do this with purrr?

observations_csv_paths_chr <- paste0("data/site", 1:3,"_observations.csv")

# Approach 2: programmatically import csv files with purrr ----

all_sites_rawdata <- observations_csv_paths_chr %>% 
  
  map(~ read_csv(., skip = observations_skip_head, 
                 n_max = nrow(read_csv("/data/site1_observations.csv", 
                                       skip = observations_skip_head))-1)) %>% 
  set_names(observations_csv_paths_chr)

I know this post uses a custom function and fread. But for my education I want to understand how to achieve this goal using the purrr approach (if it's doable).

Sam Lin
  • 39
  • 4
  • 2
    Why not filter out the last row? `df %>% filter(\`Column A\` != "Total quantity")`. You could also use `nrow` to compute the number of rows and `head` to take all but the last one. – Michael Dewar Dec 16 '22 at 07:19

2 Answers2

1

You could try something like this?

library(tidyverse)

csv_files <- paste0("data/site", 1:3, "_observations.csv")

csv_files |>
  map(
    ~ .x |> 
      read_lines() |> 
      tail(-3) |>   # skip first 3
      head(-2) |>   # ..and last 2
      paste(collapse = '\n') |>
      read_csv()
  )
Peter H.
  • 1,995
  • 8
  • 26
  • Hi Peter, it seems working - until it gives error at `read_csv()`. It says `colname1, colname2 (etc) does not exist in current working directory`. I thought the subdirectory had been included in the `csv_files` i.e. "data/" but it seems somehow it's not looking into the subdirectory in this very last step. Am I missing something here? – Sam Lin Dec 19 '22 at 05:35
  • Sorry, forgot to do `paste(collapse='\n')`. I just edited my answer - should work now. – Peter H. Dec 19 '22 at 09:46
0
manual_csv<-function(x) { 
  txt<-readLines(x)
  txt<-txt[-c(2,3,length(txt))] # insert the row you want to delete
  result<-read.csv(text=paste0(txt, collapse="\n"))
}
test<-manual_csv('D:/jaechang/pool/final.csv')
jaechang
  • 29
  • 3