0

I have a dataframe with the structure bellow:

W01           0.750000     0.916667     0.642857      1.000000      0.619565   
W02           0.880000     0.944444     0.500000      0.991228      0.675439   
W03           0.729167     0.900000     0.444444      1.000000      0.611111   
W04           0.809524     0.869565     0.500000      1.000000      0.709091   
W05           0.625000     0.925926     0.653846      1.000000      0.589286   

Variation  1_941119_A/G  1_942335_C/G  1_942451_T/C  1_942934_G/C  \
W01            0.967391      0.965909             1      0.130435   
W02            0.929825      0.937500             1      0.184211   
W03            0.925926      0.880000             1      0.138889   
W04            0.918182      0.907407             1      0.200000   
W05            0.901786      0.858491             1      0.178571   

Variation  1_944296_G/A    ...     X_155545046_C/T  X_155774775_G/T  \
W01            0.978261    ...            0.652174         0.641304   
W02            0.938596    ...            0.728070         0.736842   
W03            0.944444    ...            0.675926         0.685185   
W04            0.927273    ...            0.800000         0.690909   
W05            0.901786    ...            0.794643         0.705357   

Variation  Y_5100327_G/T  Y_5100614_T/G  Y_12786160_G/A  Y_12914512_C/A  \
W01             0.807692       0.800000        0.730769        0.807692   
W02             0.655172       0.653846        0.551724        0.666667   
W03             0.880000       0.909091        0.833333        0.916667   
W04             0.666667       0.642857        0.580645        0.678571   
W05             0.730769       0.720000        0.692308        0.720000   

Variation  Y_13470103_G/A  Y_19705901_A/G  Y_20587967_A/C  mean_age  
W01              0.807692        0.666667        0.333333      56.3  
W02              0.678571        0.520000        0.250000      66.3  
W03              0.916667        0.764706        0.291667      69.7  
W04              0.666667        0.560000        0.322581      71.6  
W05              0.703704        0.600000        0.346154      72.5  

[5 rows x 67000 columns]

I am trying to fit a robust regression using MM-estimator and gather summary statistics of the fit (p-value and the slope) using the snippet bellow:

> df %>%   gather(snp, value, -mean_age) %>% 
+     nest(-snp) %>% 
+     mutate(model = map(data, ~rlm(mean_age ~ value, data = ., method="MM", psi=psi.bisquare, maxit=50)), 
+            summary = map(model, glance)) %>% 
+     dplyr::select(-data, -model) %>% 
+     unnest(summary) -> linear_regression_results

This however throws the well-known rlm singular error:

Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  : 
  'x' is singular: singular fits are not implemented in 'rlm' 

I was wondering if theres any suggestion as to how to resolve this error?

RJF
  • 427
  • 5
  • 16
  • Have you done any searching? If so you should summarize your efforts and provide links to what you found. – IRTFM Jul 08 '19 at 20:41
  • I saw this (https://stackoverflow.com/questions/32906388/r-rlm-model-error-x-is-singular-singular-fits-are-not-implemented-in-rlm) and other threads suggesting to remove missing values or using `unique` method. Neither appeared helpful! – RJF Jul 08 '19 at 21:01
  • You have not offered a means of problem-solving that we can participate in. Notice the the successful response to the similar question had enough data to examine. – IRTFM Jul 08 '19 at 21:35
  • My dataframe is exactly like what I have explained! – RJF Jul 08 '19 at 22:47
  • I don’t see any description. But if there were one, exactly how long do you think it might take to construct a possibly similar one and why do you this this is _our_ responsibly? – IRTFM Jul 09 '19 at 20:45
  • The problem is due to duplicate observations in some columns (i.e. the values in some columns across all rows are 1). I am aware that I can use `rnorm()` or `jitter()` to get around this error, but I am wondering wether this would affect the slope of the regression line. How I could use it appropriately to restructure my df? – RJF Jul 10 '19 at 16:05

1 Answers1

0

This problem is occasionally due to duplicate measurements in the variables. As it is clear from the data-frame above for column 1_942451_T/C there are duplicate values. A simple and ad hoc solution to this problem is to jitter values:

jittered_DF <- data.frame(lapply(df, jitter))

or

r_DF <- data.frame(lapply(df, rnorm))

Perhaps it would be more precise if jitter() method could only be applied to those columns with duplicate values, and not to the whole data-frame.

RJF
  • 427
  • 5
  • 16