0

I have a data.frame created from other functions which take a list of .xlsx files and reads all workbooks and included sheets.

The result is a nice data frame as such:

df<-data.frame(
file.name <-c(“C:/R/Folder1/WB1.xlsx”,
“C:/R/Folder1/WB2.xlsx,”,”C:/R/Folder1/WB2.xlsx”, “C:/R/Folder2/WB1.xlsx”, “C:/R/Folder2/WB1.xls”),
data<-list(df1,df2,df3,df4,df5))

While I am able to retrieve (unnest) the dataframes, I lose the corresponding file location, which I need.

Is there a way to keep the corresponding row from the dataframe and paste() as a column when unnesting?

*Sorry for typos. Posted on SO app.

Updating (now that I am in front of a PC)

Reproducible Example:

data:

df1<-data.frame(V1=c(sample(900:970,6)),
                V2=c(sample(LETTERS[1:6],6)))

df2<-data.frame(V1=sample(750:780,6),
                V2=sample(LETTERS[8:16],6))

df3<-data.frame(V1=sample(200:250,6),
                V2=sample(LETTERS[10:20],6),
                V3=sample(2300:5821,6))

df4<-data.frame(V1=sample(396:480,6),
                V2=sample(LETTERS,6))

df5<-data.frame(V1=sample(50:100,6),
                V2=sample(LETTERS,6))

df6<-data.frame(V1=sample(200:250,6),
                V2=sample(LETTERS,6),
                V3=sample(letters,6))

my.list <- list(df1,df2,df3,df4,df5,df6)

mydf<-data.frame(
  files=c("C:/Folder1/Data/File1.xlsx","C:/Folder1/Data/File2.xlsx",
          "C:/Folder1/Data/File3.xlsx","C:/Folder2/Data/File1.xlsx",
          "C:/Folder2/Data/File2.xlsx","C:/Folder2/Data/File3.xlsx"))

mydf$data<-my.list

When attempting to unnest - I run in to the following issue (due to differences in observations and variables within the list of data.frames (column 2):

y<-unnest(mydf, data)
Error: Column `V3` can't be converted from integer to factor
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector...

Results from other functions

#tidyverse
y<-mydf %>% unnest(data)
Error: Column `V3` can't be converted from integer to factor
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character

y<-mydf %>%
+   unnest(data) %>%
+   group_by(files) %>%
+   mutate(
+     data = flatten_chr(data),
+     data_colname = str_c("data_", row_number())
+   ) %>% # or just as.character
+   spread(data_colname, data)
Error: Column `V3` can't be converted from integer to factor
In addition: Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character

Adding utilized function to pull .xlsx and all sheets in - as seen in this example:

library(tidyverse)
library(readxl)

dir_path1 <- "~/File1/Data/Qtr1"  
dir_path2 <- "~/File1/Data/Qtr2"         
dir_path3 <- "~/File1/Data/Qtr3"  
dir_path4 <- "~/File1/Data/Qtr4"

re_file <- ".xlsx"     

read_sheets <- function(dir_path1, file){
  xlsx_file <- paste0(dir_path1, file)
  xlsx_file %>%
    excel_sheets() %>%
    set_names() %>%
    map_df(read_excel, path = xlsx_file, .id = 'sheet_name') %>% 
    mutate(file_name = file) %>% 
    select(file_name, sheet_name, everything())
}

df <- list.files(dir_path, re_file) %>% 
  map_df(~ read_sheets(dir_path, .))

Returns:

# A tibble: 15 x 5
   file_name  sheet_name  col1  
   <chr>      <chr>      <dbl> 
 1 Q1_File1.xlsx Sheet1    1         
 2 Q1_File1.xlsx Sheet2    1         
 3 Q1_File2.xlsx Sheet1    1          
 ...

However, unlike the sample data (as shown in the link), the data returned (col1) is a list of data frames.

OctoCatKnows
  • 399
  • 3
  • 17
  • Maybe see the unnest function in the tidyr package (see examples https://tidyr.tidyverse.org/reference/unnest.html) – Frank Jun 10 '19 at 19:02

1 Answers1

2

The issue is related to df3 having a numeric for V3 while df6 is a character for V3. You can:

  1. Skip importing either df3$V3 or df6$V3
  2. Rename one of those variables

Also, to get rid of the warnings, you could create your data.frames with stringsAsFactors = FALSE or you could use tibble() instead of data.frame() as that's the default behavior of a tibble.

Edit: to better do option 2, you can use the code below to add a prefix to each variable.

my.list2 <- lapply(my.list, function(x) sapply(x, function(y) paste0(class(y), names(y))))
       , function(x) 
         {
         x%>%
           rename_if(is.numeric, ~paste0('num', .x))%>%
           rename_if(is.character, ~paste0('char', .x))%>%
           rename_if(is.factor, ~paste0('fact', .x))
         }
       )

This is option 2 and it works with only the factor warnings:

df1<-data.frame(V1=c(sample(900:970,6)),
                V2=c(sample(LETTERS[1:6],6)))

df2<-data.frame(V1=sample(750:780,6),
                V2=sample(LETTERS[8:16],6))

df3<-data.frame(V1=sample(200:250,6),
                V2=sample(LETTERS[10:20],6),
                V4=sample(2300:5821,6)) #used to be V3

df4<-data.frame(V1=sample(396:480,6),
                V2=sample(LETTERS,6))

df5<-data.frame(V1=sample(50:100,6),
                V2=sample(LETTERS,6))

df6<-data.frame(V1=sample(200:250,6),
                V2=sample(LETTERS,6),
                V3=sample(letters,6))

my.list <- list(df1,df2,df3,df4,df5,df6)

mydf<-data.frame(
  files=c("C:/Folder1/Data/File1.xlsx","C:/Folder1/Data/File2.xlsx",
          "C:/Folder1/Data/File3.xlsx","C:/Folder2/Data/File1.xlsx",
          "C:/Folder2/Data/File2.xlsx","C:/Folder2/Data/File3.xlsx"))

mydf$data<-my.list

unnest(mydf, data)

                        files  V1 V2   V4   V3
1  C:/Folder1/Data/File1.xlsx 951  A   NA <NA>
2  C:/Folder1/Data/File1.xlsx 932  F   NA <NA>
3  C:/Folder1/Data/File1.xlsx 908  B   NA <NA>
4  C:/Folder1/Data/File1.xlsx 953  C   NA <NA>
5  C:/Folder1/Data/File1.xlsx 929  E   NA <NA>
6  C:/Folder1/Data/File1.xlsx 928  D   NA <NA>
7  C:/Folder1/Data/File2.xlsx 778  K   NA <NA>
8  C:/Folder1/Data/File2.xlsx 771  H   NA <NA>
9  C:/Folder1/Data/File2.xlsx 757  M   NA <NA>
10 C:/Folder1/Data/File2.xlsx 773  P   NA <NA>
11 C:/Folder1/Data/File2.xlsx 759  N   NA <NA>
12 C:/Folder1/Data/File2.xlsx 765  O   NA <NA>
13 C:/Folder1/Data/File3.xlsx 236  M 3964 <NA>
14 C:/Folder1/Data/File3.xlsx 214  O 5241 <NA>
...truncated
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Hmm, well the issue here is that the .xlsx files are being read in - I mean, a lot. Some have 50, some have 300. The colnames are the same for them as well. Any idea how I might overcome this? I am going to paste the example function I use to read the files in- – OctoCatKnows Jun 10 '19 at 21:59
  • That may be a different question. But can't you import the spreadsheet with headers? Otherwise, how could you really refer to the fields down the road? – Cole Jun 10 '19 at 22:04
  • See comment. You should still ask a different question if you want a better answer to the part of renaming variables based on class – Cole Jun 12 '19 at 00:21
  • hey @Cole. I got a solution - I believe. I am going to post it here just in case someone has the same problem, – OctoCatKnows Jun 12 '19 at 09:34