0

I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.

> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);

Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).

The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.

The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"

My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.

However when I try

> genes <- df[1]

It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.

Also

> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)

Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.

I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.

I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.

Cheers guys

Will Finch
  • 3
  • 1
  • 3
  • So what exactly is the desired result here? Are you trying to pull out the values that start with "A2M|2"? When the first column doesn't have a header, R will read those in as rownames. Try looking at `rownames(df)`. – MrFlick Apr 09 '19 at 15:17

2 Answers2

0

If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.

genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B

As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.

genes <- df[1]
Dij
  • 1,318
  • 1
  • 7
  • 13
  • The difference is in the original poster's subsetting syntax. If you subset with `df[,1]` you return a vector (vectorized version of the first column of the `df`). If you call `df[1]`, you avoid this issue. I am simply answering the questions as it was asked. – Dij Apr 09 '19 at 15:37
  • The problem I am having is that both of those lines of code `genes <- as.data.frame(df[,1])` and `genes <- df[1]` subset the column to the right of the column I intend to subset. I intend to subset the gene names in the first column, not the numerical values in the 2nd column. I've edited my original post for clarity, thanks. – Will Finch Apr 09 '19 at 16:50
  • Ah I see. I originally misunderstood your question. You want to extract the row names using the `rownames` function – Dij Apr 09 '19 at 18:21
0

Please include a sample of your text file as text instead of an image.


I have created a dataset similar to yours:

    X   Y
1   a   b
2   c   d
3   d   g

Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:

If row.names is not specified and the header line has one less entry than the number of columns, the first column is taken to be the row names.

Reading it in R:

df <- read.table(text = '   X   Y
1   a   b
2   c   d
3   d   g')

So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)

There are numerous ways to convert your row names to a column see for example this question.

KingBoomie
  • 278
  • 1
  • 12
  • Thank you! The problem was I couldn't locate my gene names, so using `genes <- rownames(df)` produced the results I was looking for. – Will Finch Apr 09 '19 at 17:26
  • Glad I could help, now you have to take a look at `?order`, like so: `sorted.genes <- genes[order(genes)]` and get the top 20: `top.20 <- head(sorted.genes, 20)` @WillFinch – KingBoomie Apr 09 '19 at 17:36