how to make a data frame for lm in r

Question

I have a dataframe including gene expression with all genes for individuals and weight values for all individuals. I want to perform lm for all genes (A1--A13) with weight for all individuals (S1-----S9).

gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0

The output I want is p-values for all genes for weight across all individuals. The problem I am having is this dataframe is not reading weight as a separate column. Thank you!

What gene expression values are these? Counts? Signal intensities? — Anurag N. Sharma, May 05 '20 at 16:13
This is the wrong structure for doing ... most things in R. I suspect you'll need to: (1) `t`ranspose the data so that individuals are *rows*; (2) re-convert to a `data.frame`; (3) convert most things back with `as.integer`; then (4) figure out what `1,341758933` means *numerically* for weighting (it means nothing to me), and parse it appropriately. (This goes for `A3` as well ... that isn't numerical.) — r2evans, May 05 '20 at 16:15
Are you sure what you're attempting to model has any use given that there are mostly only lowly expressed genes in each sample with a few exceptions? — Anurag N. Sharma, May 05 '20 at 16:30
@AnuragN.Sharma, I have more genes. This is just an example. I want to get the model running first. — Jessica, May 05 '20 at 16:34
@r2evans, I have more number of genes than total number of columns in excel. — Jessica, May 05 '20 at 17:24
Is there a way to drop genes that show 0 TPM across all the individuals. — Jessica, May 05 '20 at 18:05
Yes you can remove all columns (after you transpose) with `library(dplyr) all_zero <- function(x) !sum(x, na.rm = TRUE) == 0 model_me <- model_me %>% select_if(all_zero) ` — Chuck P, May 05 '20 at 20:14
@Jessica to deal with the lowly expressing genes, the package ```edgeR``` has a solution provided you have access to the raw read counts. Go through the vignette at https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf. It is preferable to not filter them out manually. And once you have done that, you could alternatively use logCPM values or log TPM values for your regression. — Anurag N. Sharma, May 06 '20 at 02:17

score 1 · Accepted Answer · answered May 05 '20 at 20:06

1

You can try this,

library(broom)
library(dplyr)

weight = t(mat[1,-1])
y = mat[-1,-1]
rownames(y) = as.character(mat[,1])[-1]
colnames(y) = colnames(mat)[-1]
y = t(y)

Now we have it in the correct format, every column of y (genes) against x (weight). You can exclude columns that are all zeros (or colMeans > some value), in the above example, you have only 1 column that is not all zeros:

table(colMeans(y)>0)
FALSE  TRUE 
   12     1

Hopefully you sort that out. For example purpose, I will create a matrix Y the same dimensions as y and regress through everything:

Y = matrix(runif(length(y)),ncol=ncol(y),dimnames=list(rownames(y),colnames(y)))

tidy(lm(Y ~ weight)) %>% filter(term!="(Intercept)")
# A tibble: 13 x 6
   response term   estimate std.error statistic p.value
   <chr>    <chr>     <dbl>     <dbl>     <dbl>   <dbl>
 1 A1       weight -0.00297     0.159   -0.0186  0.986 
 2 A2       weight  0.184       0.144    1.28    0.248 
 3 A3       weight -0.0986      0.111   -0.888   0.409 
 4 A4       weight -0.0459      0.202   -0.227   0.828 
 5 A5       weight -0.111       0.115   -0.966   0.371 
 6 A6       weight -0.0160      0.142   -0.113   0.914 
 7 A7       weight  0.250       0.107    2.34    0.0577
 8 A8       weight  0.159       0.185    0.859   0.423 
 9 A9       weight  0.0612      0.148    0.413   0.694 
10 A10      weight  0.347       0.100    3.45    0.0136
11 A11      weight -0.186       0.182   -1.02    0.346 
12 A12      weight -0.0748      0.177   -0.422   0.688 
13 A13      weight  0.0784      0.137    0.572   0.588

The above should work quite ok for a few thousand columns.. Again makes sense to filter out the columns with too many zeros.

The data used:

mat = read.table(text="gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0",header=TRUE,dec=",")

answered May 05 '20 at 20:06

StupidWolf

45,075
17
40
72

thank you! It worked and gave the p values. However, when I filtered out the columns with all zeros using, df <- function(df) !sum(df, na.rm = TRUE) == 0 df <- model_me %>% select_if(df) Then rownames(y) = as.character(mat[,1])[-1] gave an error: Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length In addition: Warning message: Setting row names on a tibble is deprecated. – Jessica May 05 '20 at 21:59
1

oh.. you have a tibble..how big is your data? you can just convert it to a data.frame, or read in your data using read.csv like I did. I don't think you gain much by reading it in a tibble. And remember to use the dec="," – StupidWolf May 05 '20 at 22:16
Thank you! It worked as I imported using read.csv. If I try to drop the genes with all zeros it changes data structure. Is there a way to drop without changing structure. – Jessica May 05 '20 at 22:25
1

how many are you left with? If you are left with one, from my code above, do ```y[,colSums(y)>0,drop=FALSE]``` after the t() ... – StupidWolf May 05 '20 at 22:28
Thank you so much, it worked. Can I add gene annotations from a gff file looking like: seqid source type start end score strand phase attributes 1 G1.3ch00 maker gene 21882 22007 NA - ID=G_06g024400.1;Name=G_06g024400.1;Note=Similar to 06g024400.1.1: LOW QUALITY:protein;ref_id=06g024400.1.1. I want to add a column to the results from lm with Note from gff column attributes. Thank you! – Jessica May 05 '20 at 23:48
you import the gff using rtracklayer, e.g gff = import(), then you extract the columns annotation_df = values(gff) – StupidWolf May 05 '20 at 23:56
I have a question if I have multiple response variables in addition to weight like height and yield. How can I transform the dataset to correct format as y dataset in your answer. So that I do not have to make multiple input datasets and can perform lm for each response variable one by one with the same dataset. Thank you! – Jessica May 06 '20 at 18:55
I followed your answer and got the results. But I get different results for the order of genes and p values if I repeat the same code. I checked everything a couple of times. I using same code, same dataset but getting different p values for genes every time. Thank you! – Jessica May 07 '20 at 19:28
did you run this line ```Y = matrix(runif(length(y)),ncol=ncol(y),dimnames=list(rownames(y),colnames(y)))```? This is supposed to be a simulation because in your example, there is not enough data – StupidWolf May 07 '20 at 19:31
1

what you need to do is ```y = mat[-1,-1]; rownames(y) = as.character(mat[,1])[-1]; colnames(y) = colnames(mat)[-1]; Y = t(y); Y= Y[,colSums(Y)>0] to remove columns with all 0s and run the lm – StupidWolf May 07 '20 at 19:33

Chuck P · Answer 2 · 2020-05-05T20:17:42.677

As @r2evans noted your biggest problem is your data are in the wrong orientation...

Need to fix that. I'm going to assume the dataset is too large for you to do that manually. So here is some very ugly R code but that you should be able to modify no matter how many subjects or genes.

# what you gave us
your_data <- read.table(text = "gene    S1  S2  S3  S4  S5  S6  S7  S8  S9
weight  1,34175933  NA  0,506664615 2,404181093 0,853749494 0,931450603 2,666384344 1,483623026 1,908323207
A1  0   0   0   0   0   0   0   0   0
A2  0   0   0   0   0   0   0   0   0
A3  0,047059    0   0   0   0,055744    0   0   0   0
A4  0   0   0   0   0   0   0   0   0
A5  0   0   0   0   0   0   0   0   0
A6  0   0   0   0   0   0   0   0   0
A7  0   0   0   0   0   0   0   0   0
A8  0   0   0   0   0   0   0   0   0
A9  0   0   0   0   0   0   0   0   0
A10 0   0   0   0   0   0   0   0   0
A11 0   0   0   0   0   0   0   0   0
A12 0   0   0   0   0   0   0   0   0
A13 0   0   0   0   0   0   0   0   0", header = TRUE)

# save your data from excel to a csv file
# your_data <- read.table("untitled.csv", header = TRUE )
# should show about like this
your_data
#>      gene         S1 S2          S3          S4          S5          S6
#> 1  weight 1,34175933 NA 0,506664615 2,404181093 0,853749494 0,931450603
#> 2      A1          0  0           0           0           0           0
#> 3      A2          0  0           0           0           0           0
#> 4      A3   0,047059  0           0           0    0,055744           0
#> 5      A4          0  0           0           0           0           0
#> 6      A5          0  0           0           0           0           0
#> 7      A6          0  0           0           0           0           0
#> 8      A7          0  0           0           0           0           0
#> 9      A8          0  0           0           0           0           0
#> 10     A9          0  0           0           0           0           0
#> 11    A10          0  0           0           0           0           0
#> 12    A11          0  0           0           0           0           0
#> 13    A12          0  0           0           0           0           0
#> 14    A13          0  0           0           0           0           0
#>             S7          S8          S9
#> 1  2,666384344 1,483623026 1,908323207
#> 2            0           0           0
#> 3            0           0           0
#> 4            0           0           0
#> 5            0           0           0
#> 6            0           0           0
#> 7            0           0           0
#> 8            0           0           0
#> 9            0           0           0
#> 10           0           0           0
#> 11           0           0           0
#> 12           0           0           0
#> 13           0           0           0
#> 14           0           0           0
# let's flip it in R
your_data <- as.data.frame(t(your_data))
your_data
#>               V1 V2 V3       V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
#> gene      weight A1 A2       A3 A4 A5 A6 A7 A8  A9 A10 A11 A12 A13
#> S1    1,34175933  0  0 0,047059  0  0  0  0  0   0   0   0   0   0
#> S2          <NA>  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S3   0,506664615  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S4   2,404181093  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S5   0,853749494  0  0 0,055744  0  0  0  0  0   0   0   0   0   0
#> S6   0,931450603  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S7   2,666384344  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S8   1,483623026  0  0        0  0  0  0  0  0   0   0   0   0   0
#> S9   1,908323207  0  0        0  0  0  0  0  0   0   0   0   0   0
# let's write it back out since you say you have a lot of genes
write.csv(your_data, file = "transposed.csv", na = "", row.names = TRUE)
# read it back in and get the header correct
fixed_data <- read.csv("transposed.csv", skip = 1, header = TRUE, na.strings = "")
fixed_data
#>   gene      weight A1 A2       A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13
#> 1   S1  1,34175933  0  0 0,047059  0  0  0  0  0  0   0   0   0   0
#> 2   S2        <NA>  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 3   S3 0,506664615  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 4   S4 2,404181093  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 5   S5 0,853749494  0  0 0,055744  0  0  0  0  0  0   0   0   0   0
#> 6   S6 0,931450603  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 7   S7 2,666384344  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 8   S8 1,483623026  0  0        0  0  0  0  0  0  0   0   0   0   0
#> 9   S9 1,908323207  0  0        0  0  0  0  0  0  0   0   0   0   0
# better?  it's now by subject not gene
fixed_data$subject <- fixed_data$gene
# I'm fixing the ones in your sample data you need to check all columns
fixed_data$weight <- as.numeric(stringr::str_replace(fixed_data$weight, ",", "."))
fixed_data$A3 <- as.numeric(stringr::str_replace(fixed_data$A3, ",", "."))
# much better
fixed_data
#>   gene    weight A1 A2       A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 subject
#> 1   S1 1.3417593  0  0 0.047059  0  0  0  0  0  0   0   0   0   0      S1
#> 2   S2        NA  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S2
#> 3   S3 0.5066646  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S3
#> 4   S4 2.4041811  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S4
#> 5   S5 0.8537495  0  0 0.055744  0  0  0  0  0  0   0   0   0   0      S5
#> 6   S6 0.9314506  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S6
#> 7   S7 2.6663843  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S7
#> 8   S8 1.4836230  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S8
#> 9   S9 1.9083232  0  0 0.000000  0  0  0  0  0  0   0   0   0   0      S9
# make a copy
model_me <- fixed_data
# remove non regressor relevants
model_me$gene <- NULL
model_me$subject <- NULL

# per your question in the comments remove all columns 
# where TPM is zero for all subjects
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
all_zero <- function(x) !sum(x, na.rm = TRUE) == 0
model_me <- model_me %>% select_if(all_zero)

lm(weight ~ ., data = model_me)
#> 
#> Call:
#> lm(formula = weight ~ ., data = model_me)
#> 
#> Coefficients:
#> (Intercept)           A3  
#>       1.656      -11.174
summary(lm(weight ~ ., data = model_me))
#> 
#> Call:
#> lm(formula = weight ~ ., data = model_me)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.1489 -0.3153  0.0200  0.3767  1.0108 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   1.6556     0.3157   5.244  0.00193 **
#> A3          -11.1742    12.2409  -0.913  0.39651   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7743 on 6 degrees of freedom
#>   (1 observation deleted due to missingness)
#> Multiple R-squared:  0.1219, Adjusted R-squared:  -0.02439 
#> F-statistic: 0.8333 on 1 and 6 DF,  p-value: 0.3965

^{Created on 2020-05-05 by the reprex package (v0.3.0)}

Thank you @Chuck P! In the lm step, I am getting this error: Error: protect(): protection stack overflow — Jessica, May 05 '20 at 21:31
You're running out of memory. Take a look here for starters. https://stackoverflow.com/questions/32826906, and if you are getting it I would definitely remove useless columns with `all_zero <- function(x) !sum(x, na.rm = TRUE) == 0 model_me <- model_me %>% select_if(all_zero)` — Chuck P, May 05 '20 at 23:21

how to make a data frame for lm in r

2 Answers2