manipulating all the tables in an R group_split list

Question

Edited

I have a large table, which starts like this

Essentially, it's a table with multiple samples ("samp_id") showing the number ("least") of "taxon" present in each.

I want to transpose/pivot the table to look like this;

i.e. with "taxon" as the top row, with each of the 90 samples in "data" following as a row based on the "least" column, re-named with its "samp_id". So you see what each sample is, as well the value in "least" for each sample in the different "taxon" (which may not be identical across the 90 samples).

Previously, I have separated the data into multiple tibbles based on "samp_id", selected "taxon" and "least", re-named "least" with the "samp_id" then combined the individual tibbles based on "taxon" with full_join using something like the code below, then transposing the combined table

ACLOD_11 = data %>%
  filter(samp_id == "ACLOD_11") %>%
  select(taxon, least) %>%
  rename("ACLOD_11" = least) 

ACLOD_12 = data ... #as above, but different samp_id

data_final = list(ACLOD_11, ACLOD_12, ...) %>% 
  reduce(full_join, by = "taxon")

As I have more data tables to follow after this one with 90 samples, so I want to be able to do this without having to individually separate the data into 100s of tibbles and manually inputting the "samp_id" before joining.

I have currently split the data into 90 separate tibbles based on "samp_id" (there are 90 samples in "data")

data_split = data %>%
  group_split(samp_id)

but am unsure if this is the best way to do this, or what I should to next?

This sounds a bit like a "long to wide" problem. Have you thought about using something like `tidyr::pivot_wider()` on the columns "samp_id", "least", and "taxon"? If things are very large and you need some speed you could try `data.table::dcast()` for widening. — aosmith, Jun 07 '21 at 16:38
thanks for this, it's almost close; I've edited the question to (hopefully!) add more clarity, would appreciate your thoughs! — Maho, Jun 07 '21 at 17:40
If you select just the three columns that are relevant to the problem (samp_id, least, taxon), does `tidyr::pivot_wider(data, names_from = "taxon", values_from = "least")` get you close? If you want folks to test things, try adding some data we can copy and paste instead of screenshots. :) Some ideas on how to do that [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — aosmith, Jun 07 '21 at 19:54
that worked, thanks for your help! - yes, I understand that it's better to provide an example data so that others can test the code out :) — Maho, Jun 08 '21 at 09:04

score 0 · Answer 1 · answered Jun 07 '21 at 16:15

0

We can use

library(dplyr)
library(purrr)
data %>%
     split(.$samp_id) %>%
     imap(~ .x %>%
              select(taxon, least) %>%
              rename(!!.y := least)) %>%
      reduce(full_join, by = 'taxon')

answered Jun 07 '21 at 16:15

akrun

874,273
37
540
662

Thanks - I've tried this, but got the following error; is there a way around this? "Error: cannot allocate vector of size 1.0 Gb" – Maho Jun 07 '21 at 16:31
@Maho that is unrelated to the code. It is memory you don't have. Perhaps tyr on a new session or use a cluster with memory – akrun Jun 07 '21 at 16:33

manipulating all the tables in an R group_split list

1 Answers1