0

I am reading through excel file that has multiple sheets.

 file_to_read <- "./file_name.xlsx"
 
 # Get all names of sheets in the file
 sheet_names <- readxl::excel_sheets(file_to_read)
 
 # Loop through sheets
 L <- lapply(sheet_names, function(x) {
 all_cells <-
 tidyxl::xlsx_cells(file_to_read, sheets = x)
})

L here has all the sheets. Now, I need to get the data from each sheet to combine all the columns and rows into one file. To be exact, I want to sum the matching columns and rows in the data into one file.

I will put simple example to make it clear.

For example, this table in one sheet,

df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7

The second table in the next sheet,

df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7, w = 8:12)
rownames(df2) <- LETTERS[3:7]
df2
M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12  

My goal is to combine (sum) the matched records in all 100 tables from one excel file to get one big tables that has the total sum of each value.

The final table should be like this:

M x y  z   w
A 1 2  3   0
B 2 3  4   0
C 4 6  8   8
D 6 8  10  9
E 8 10 12 10
F 4 5  6  11
G 5 6  7  12

Is there a way to achieve this in R? I am not an expert in R, but I wish if I could know how to read all sheets and do the sum Then save the output to a file.

Thank you

Amal Nasir
  • 164
  • 15
  • Please don't post data as images. Take a look at how to make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for ways of showing data. – Martin Gal May 25 '21 at 17:11
  • @AnilGoyal Thank you. It doesn't answer my question. I am struggling with reading the 100 tables from multiple sheets and aggregate then into one table. – Amal Nasir May 25 '21 at 17:47
  • @MartinGal I updated the question and added more reproducible code. Thank you! – Amal Nasir May 25 '21 at 17:56
  • @AmalNasir, the example you gave has rownames. I created those names into a column 'ID'. If you're getting an error that id column is not found it simply means your actual data has no row names. Now first try the code on the sample data only in your R console/IDE. If it runs on example, it simply means your data is different than you've shared. If it still isn't running, problem maybe somewhere else like packages not loaded, etc. In that case you yourself have to locate the problem first. Good luck. – AnilGoyal May 26 '21 at 01:32
  • See edited answer as per edited question. Your toy_data creation is not correct becuase you're first creating row names, but again editing your data to remove rownames. I have also observed that you have not upvoted any of the answers given below, despite changing your question 5-6 times. [Upvote](https://stackoverflow.com/help/why-vote) is token of appreciation for the answerers as well as it help in sorting good content/future reference. – AnilGoyal May 26 '21 at 04:07

3 Answers3

2

As you have stated that you have hundreds of sheets it is suggested that you should import all of these in one single list say my.list in R (as per this link or this readxl documentation suggested) and follow this strategy instead of binding every two dfs one by one

df1 <- read.table(text = 'M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7', header = T)
df2 <- read.table(text = 'M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12', header = T)

library(tibble)
library(tidyverse)

my.list <- list(df1, df2)

map_dfr(my.list, ~.x)
#>    M x y z  w
#> 1  A 1 2 3 NA
#> 2  B 2 3 4 NA
#> 3  C 3 4 5 NA
#> 4  D 4 5 6 NA
#> 5  E 5 6 7 NA
#> 6  C 1 2 3  8
#> 7  D 2 3 4  9
#> 8  E 3 4 5 10
#> 9  F 4 5 6 11
#> 10 G 5 6 7 12
map_dfr(my.list , ~ .x) %>%
  group_by(M) %>%
  summarise(across(everything(), sum, na.rm = T))
#> # A tibble: 7 x 5
#>   M         x     y     z     w
#>   <chr> <int> <int> <int> <int>
#> 1 A         1     2     3     0
#> 2 B         2     3     4     0
#> 3 C         4     6     8     8
#> 4 D         6     8    10     9
#> 5 E         8    10    12    10
#> 6 F         4     5     6    11
#> 7 G         5     6     7    12

Created on 2021-05-26 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
1

One approach that will work is these steps:

  • read each sheet into a list
  • convert each sheet into a long format
  • bind into a single data frame
  • sum and group by over that long data frame
  • cast back to tabular format

That should work for N sheets with any combination of row and column headers in those sheets. E.g.

file <- "D:\\Book1.xlsx"
sheet_names <- readxl::excel_sheets(file)
sheet_data <- lapply(sheet_names, function(sheet_name) {
  readxl::read_xlsx(path = file, sheet = sheet_name)
})

# use pivot_longer on each sheet to make long data
long_sheet_data <- lapply(sheet_data, function(data) {
  long <- tidyr::pivot_longer(
    data = data,
    cols = !M,
    names_to = "col",
    values_to = "val"
  )
})

# combine into a single tibble
long_data = dplyr::bind_rows(long_sheet_data)

# sum up matching pairs of `M` and `col`
summarised <- long_data %>%
  group_by(M, col) %>%
  dplyr::summarise(agg = sum(val))
  
# convert to a tabular format
tabular <- summarised %>%
  tidyr::pivot_wider(
    names_from = col,
    values_from = agg,
    values_fill = 0
  )

tabular

I get this output with a spreadsheet using your initial inputs:

> tabular
# A tibble: 7 x 5
# Groups:   M [7]
  M         x     y     z     w
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A         1     2     3     0
2 B         2     3     4     0
3 C         4     6     8     8
4 D         6     8    10     9
5 E         8    10    12    10
6 F         4     5     6    11
7 G         5     6     7    12
Robin Mackenzie
  • 18,801
  • 7
  • 38
  • 56
0

You could use dplyr and tidyr to get your desired result:

Let be

df <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart2", "child", "tech", "unkn"), mother_chart2=1:4, mother_child=5:8, child_tech=9:12, child_unkn=13:16)
> df
  subject modifier mother_chart2 mother_child child_tech child_unkn
1  Mother   chart2             1            5          9         13
2  Mother    child             2            6         10         14
3   Child     tech             3            7         11         15
4   Child     unkn             4            8         12         16

and

df2 <- data.frame(subject=c(rep("Mother", 2), rep("Child", 2)), modifier=c("chart", "child", "tech", "unkn"), mother_chart=101:104, mother_child=105:108, child_tech=109:112, child_unkn=113:116)

> df2
  subject modifier mother_chart mother_child child_tech child_unkn
1  Mother    chart          101          105        109        113
2  Mother    child          102          106        110        114
3   Child     tech          103          107        111        115
4   Child     unkn          104          108        112        116

Then

library(dplyr)
library(tidyr)

df2_tmp <- df2 %>%
  pivot_longer(col=-c("subject", "modifier"))

df %>%
  pivot_longer(col=-c("subject", "modifier")) %>%
  full_join(df2_tmp, by=c("subject", "modifier", "name")) %>%
  mutate(across(starts_with("value"), ~ replace_na(., 0)),
         sum = value.x + value.y) %>%
  select(-value.x, -value.y) %>%
  pivot_wider(names_from=name, values_from=sum, values_fill=0)

returns

# A tibble: 5 x 7
  subject modifier mother_chart2 mother_child child_tech child_unkn mother_chart
  <chr>   <chr>            <dbl>        <dbl>      <dbl>      <dbl>        <dbl>
1 Mother  chart2               1            5          9         13            0
2 Mother  child                2          112        120        128          102
3 Child   tech                 3          114        122        130          103
4 Child   unkn                 4          116        124        132          104
5 Mother  chart                0          105        109        113          101
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thank you but how to use this if I am reading the data from excel that has multiple sheets? – Amal Nasir May 25 '21 at 17:49
  • 1
    I don't use excel workbooks for storing data an multiple sheets. You could take a look at (https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames) or (https://stackoverflow.com/questions/46685223/import-multiple-sheets-into-multiple-data-frames-in-r). Perhaps these questions and answers can help you out. I focused on the problem of combining different contigency tables. – Martin Gal May 25 '21 at 17:54