1

I have two dataframes (df1, df2). I would like to combine them such that if a gene is present in df1 for a specific sample ID, I would like it to be allocated "1" (present") in the appropriate matching gene column and sample ID row in df2. If the gene is not present in df1, it would autopopulate df2 with a "0" (absent), as seen in the result dataframe below. Your assistance on this would be much appreciated. Thank you in advance.

df1

+-----------+-----+
| Gene name | ID  |
+-----------+-----+
| A         | 100 |
| A         | 105 |
| B         | 100 |
| B         | 101 |
| B         | 103 |
| C         | 105 |
+-----------+-----+

df2

+-----+---+---+---+---+
| ID  | A | B | C | D |
+-----+---+---+---+---+
| 100 |   |   |   |   |
| 101 |   |   |   |   |
| 102 |   |   |   |   |
| 103 |   |   |   |   |
| 104 |   |   |   |   |
| 105 |   |   |   |   |
| 106 |   |   |   |   |
+-----+---+---+---+---+

result

+-----+---+---+---+---+
| ID  | A | B | C | D |
+-----+---+---+---+---+
| 100 | 1 | 1 | 0 | 0 |
| 101 | 0 | 1 | 0 | 0 |
| 102 | 0 | 0 | 0 | 0 |
| 103 | 0 | 1 | 0 | 0 |
| 104 | 0 | 0 | 0 | 0 |
| 105 | 0 | 0 | 1 | 0 |
| 106 | 0 | 0 | 0 | 0 |
+-----+---+---+---+---+
Edmond Kwan
  • 157
  • 4
  • No need for a merge. A simple long to wide conversion of `df1` suffices: `df1 %>% mutate(n = 1) %>% spread(Gene.name, n, fill = 0)` or `reshape2::dcast(transform(df1, n = 1), ID ~ Gene.name, fill = 0)`. See the dupe target links for many more options. – Maurits Evers Sep 19 '19 at 02:52
  • PS and on a slightly off-topic note: I frequently see sample data presented in such a way. Generally this is *not* a good way, because we can't easily copy & paste data into an R terminal (it requires us to strip off the extra and unnecessary "border characters"). Is there a particular tool that outputs data in such a way? I hope you don't spend time formatting data like this by hand? In which case, please don't:-) – Maurits Evers Sep 19 '19 at 03:22
  • Thanks Maurits, much appreciated. Apologies I wasn't clear, but the reason why I wanted to include df2 is that not all samples have a gene tied to them. For example, df1 makes no mention of Sample 106, so how do I ensure in my output that sample 106 gets a "0" for all the different genes? – Edmond Kwan Sep 19 '19 at 04:23
  • In regards to your second question, unfortunately the way I receive the raw data is such that I have to format by "hand". Essentially I get a list of "genomic variants" that have abnormal, and if a specific gene is not listed for a sample, the assumption is that no variants exists in that gene? is there a better way? – Edmond Kwan Sep 19 '19 at 04:26
  • In regards to your first question I would use `complete` to fill in missing levels. `genes <- LETTERS[1:4]; IDs <- 100:106; df1 %>% mutate(n = 1, Gene.name = factor(Gene.name, levels = lvl), ID = factor(ID, levels = IDs)) %>% complete(Gene.name, ID) %>% spread(Gene.name, n, fill = 0)`; adjust `genes` and `IDs` as necessary. – Maurits Evers Sep 19 '19 at 04:51
  • In regards to your second question, it's best to use `dput` to include reproducible sample data (see `?dput` for details). Alternatively, include a space/tab delimited text version of your sample data (without the additional border characters), which we can then read in fairly easily with `read.table(text = ...)` (provided there are no missing values). – Maurits Evers Sep 19 '19 at 04:53
  • Apologies. I'm a bit confused. You specify the string "genes", but then in the subsequent argument it doesn't get recalled again. Is it meant to read `factor(Gene.name, levels = genes)`, where Gene.name is the column name in df1? I get an error that reads `Duplicate identifiers for rows (136, 137), (1905, 1906)` – Edmond Kwan Sep 19 '19 at 06:20
  • Ah my bad & sorry for the confusion. Yes, it should've been `df1 %>% mutate(n = 1, Gene.name = factor(Gene.name, levels = genes), ID = factor(ID, levels = IDs)) %>% complete(Gene.name, ID) %>% spread(Gene.name, n, fill = 0)`. The duplicate identifiers error is another issue that stems from exactly that: Duplicate `Gene.name`s. You need to find a way to deal with them. Otherwise the columns in the wide data won't be unique. This is a very general issue that often pops up when reshaping data from long to wide. – Maurits Evers Sep 19 '19 at 06:23

0 Answers0