Is there a way to collapse and convert a dataframe based upon multiple criteria?

Question

I have a dataset (matrix) of genotypes at different marker loci.

structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

I am trying to convert the dataset from 0/1 notation to the following:

Genotype	Locus_1_1	Locus_1_2	Locus_1_3
1	b	c
2	a	b
3	a	c
4	a	b	c
5	a	a

Where most genotypes (1 through 3, in this example) are diploid (2n) and have two distinct alleles, that is represented as a string at the end of the column name.

Individual 4 is a triploid (3n) individual and has three distinct alleles.

Individual 5 is a diploid (2n) individual, but is homozygous for a single allele (Locus_1a) but should have it presented twice in the dataset.

The data are presented with the marker locus name in columns, with a variable string at the end based upon which allele is detected for that individual (a, b, c, etc.).

I'm not exactly sure how to execute the code for this problem to solve this task.

The above example represents a minimal, reproducible example. — Josh, Jun 05 '23 at 21:17
Yes, it is, but we can not manipulate the data ourselves if it is not shared as code. When the OP shares data as formatted tables, we have to either use clunky functions to read the data in, or hard code the data in our own R sessions. This is why it is usually considered good practice to share the data either es `data.frame(Genotype = ....` or as the output of`dput(data)`, which usually looks like `structure(....` — GuedesBF, Jun 05 '23 at 21:21
structure(list(Genotype = 1:5, locus1_a = c(0L, 1L, 1L, 1L, 1L ), locus1_b = c(1L, 0L, 0L, 0L, 0L), locus1_c = c(0L, 0L, 0L, 0L, 0L), locus1_c.1 = c(0L, 0L, 0L, 0L, 0L), locus1_d = c(1L, 0L, 0L, 0L, 1L), locus1_e = c(0L, 0L, 1L, 1L, 0L), locus1_f = c(0L, 0L, 0L, 0L, 0L), locus1_g = c(0L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L)) — Josh, Jun 05 '23 at 21:26
The naming pattern in the example shared with `dput` is a bit different. There is a column called `locus_1c.1`. Is this an error? — GuedesBF, Jun 05 '23 at 21:38
Also, please use a single data.set. The formated table and the dput data are different, and have some relevant structural differences — GuedesBF, Jun 05 '23 at 21:38
structure(list(Genotype = 1:5, Locus_1a = c(0L, 1L, 1L, 1L, 1L ), Locus_1b = c(1L, 1L, 0L, 1L, 0L), Locus_1c = c(1L, 0L, 1L, 1L, 0L), Locus_2a = c(0L, 1L, 1L, 0L, 0L), Locus_2b = c(1L, 1L, 0L, 1L, 1L), Locus_2c = c(1L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L)) — Josh, Jun 05 '23 at 21:58
I know I may look as a bit annoying with all those suggestions, but please consider all this as some candid advice that will greatly increase the chances of getting valuable help. Please always include relevant data inside the question, not as comments. The question should be self-contained — GuedesBF, Jun 05 '23 at 22:10
You may consider checking this: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example This is a must for newcomers in SO - R programmers — GuedesBF, Jun 05 '23 at 22:10

score 1 · Answer 1 · answered Jun 05 '23 at 22:52

1

df %>%
   pivot_longer(-Genotype) %>%
   filter(value>0)%>%
   extract(name, "value", "_\\d(.*)")%>%
   distinct() %>%
   mutate(name = row_number(), .by = Genotype) %>%
   pivot_wider(names_prefix = 'Locus_')

# A tibble: 5 × 4
  Genotype Locus_1 Locus_2 Locus_3
     <int> <chr>   <chr>   <chr>  
1        1 b       c       NA     
2        2 a       b       NA     
3        3 a       c       NA     
4        4 a       b       c      
5        5 a       b       NA

answered Jun 05 '23 at 22:52

Onyambu

67,392
3
24
53

The issue here is that genotype 5 is incorrect. It should be a/a, but your code scores it as a/b. It only detects an allele at locus a, but is a diploid individual, and therefore is homozygous for allele a. – Josh Jun 05 '23 at 23:20
@Josh the data you gave has `1` in `a` and `b`. Probably you need to change that to have data on `a` and `a` instead – Onyambu Jun 05 '23 at 23:23
Here the Locus_1 has three alleles that can be detected. A, B, or C. Same with Locus_2. It’s not three separate loci, it’s two loci with three different alleles. The issue becomes when an individual is homozygous for one allele, it needs to be represented more than once. – Josh Jun 05 '23 at 23:29
@Josh check your data again. Note that all the rows are same except row 5. Its not because of the code but rather because of the data provided. – Onyambu Jun 05 '23 at 23:31

Is there a way to collapse and convert a dataframe based upon multiple criteria?

1 Answers1