Apply wildcard when reading multiple sheets from an .xls file using rio::import_list

Question

I have been trying to write a code that will read selected sheets in an .xls file using import_list (rio package), then rbinding and converting them to a .csv file. So far, I have been able to make the code read either all the sheets in the xls file at the same time, sequential sheets (eg 1:4) but not selected sheets based on their names. Please help. I have attached the current iteration of my code and a part of my data.

library(rio)
d <- import_list("allTP.xls", which = (("201901,Xizhi,PM10")), setclass="data.table",range = "A2:Y33", rbind = T, col_names=T, rbind_label = "_file", rbind_fill = TRUE)

Sample data

Reading the documentation, it looks like the `which` argument allows all or one, not some. — Edward, Apr 11 '20 at 13:08
I found that I can select some sheets but it has to be sequential, and I can't use wildcards, any suggestion of an alternative way to do this? — ndichistan, Apr 11 '20 at 13:12

Edward · Answer 1 · 2020-04-12T01:49:35.517

The which argument has to be a vector of sheet names. That is, it needs to be in this format, using the c function:

which = c("201901,Xizhi,PM10", "201902,Xizhi,PM10", "201902,Wanli,PM10")

or whatever sheets you want to import.

The which argument doesn't appear to support regular expression, so importing sheets using wildcards is not possible. But you can import all sheets and then use the _file column to select the data to individually export (if you know which sheet number is which since the sheet names are not saved, only the number).

In your data, you have all the information in the first row, but you don't import this because you specify the range as "A2:Y33". If you select "A1:Y33" then you'll get "characters" for all columns, and you don't want that since the column names start in row 2.

One solution is to import all sheets, range="A2:Y33", as you did, and then repeat for range "A1:Y2", saving the result as d2. This range contains the information you need to select. You can then merge d with d2 on _file and export individual csv files based on the information in d2.

library(rio)

d1 <- import_list("sample_data.xlsx", 
                  setclass="data.table", 
                  range = "A2:Y33", rbind = T, 
                  col_names=T, 
                  rbind_label = "_file", 
                  rbind_fill = TRUE)

For d2 we import only range "A1:Y1", then subset columns 1, 7, 13 and 26, and set the names appropriately.

d2 <- setNames(
  subset(
    import_list("sample_data.xlsx", 
                setclass="data.table", 
                range = "A1:Y1", rbind = T, 
                col_names=F, 
                rbind_label = "_file"), 
    select=c(1,7,13,26)), 
  c("Site","Param","YM","_file"))

Then merge the two datasets on the _file column.

d3 <- merge(d1, d2, by="_file")

Check the data.

table(d3$Site, d3$Param, d3$YM)
, ,  = YM: 2019/01                     
                     Param: AMB_TEMP(℃) Param: PM10(μg/m3)
  Site: Wanli                        31                 31
  Site: Xizhi                        31                 31

, ,  = YM: 2019/02                     
                     Param: AMB_TEMP(℃) Param: PM10(μg/m3)
  Site: Wanli                        31                 31
  Site: Xizhi                        31                 31

So there's eight sheets each containing 31 records. But we have to clean these names because if we use these for the filenames, R will complain.

d3$Site <- sub(".*: (.+)", "\\1", d3$Site)
d3$Param <- sub(".*: (.+)\\(.+", "\\1", d3$Param)
d3$YM <- sub(".*: (\\d{4})\\/(.+)", "\\1_\\2", d3$YM)

Now split on the interaction of all three contextual variables

f <- with(d3, split(d3, list(Site, Param, YM)))

And save to individual CSV files.

lapply(names(f), function(x) write.csv(f[[x]], file=paste(x, ".csv", sep="")))

dir()
[1] "Wanli.AMB_TEMP.2019_01.csv" "Wanli.AMB_TEMP.2019_02.csv" "Wanli.PM10.2019_01.csv"    
[4] "Wanli.PM10.2019_02.csv"     "Xizhi.AMB_TEMP.2019_01.csv" "Xizhi.AMB_TEMP.2019_02.csv"
[7] "Xizhi.PM10.2019_01.csv"     "Xizhi.PM10.2019_02.csv"

You can then easily choose to export only the data you want using a subset of f or d3.

Thanks for your suggestion. Can this be modified to allow the use of wildcards? I am still not able to use wildcards — ndichistan, Apr 11 '20 at 14:26
I tried to use the expression "which" this way "which = c(glob2rx("20**,Xizhi,PM10"))" The idea was to match all the sheets with the '201901 - 202001', Xizhi,PM10. — ndichistan, Apr 12 '20 at 00:10
Please check my edited answer and let me know if that works. — Edward, Apr 12 '20 at 01:49
Your edited answer worked perfectly!. I have written another script to help me complete the task. Many thanks for your help. — ndichistan, Apr 12 '20 at 12:04
Don't forget to vote. :) https://stackoverflow.com/help/someone-answers — Edward, Apr 12 '20 at 12:05

Apply wildcard when reading multiple sheets from an .xls file using rio::import_list

1 Answers1