0

I have 569 rows of data related to breast cancer. In column A, each row either has a value of 'M' or 'B' in the cell (malignant or benign). In column B, the concavity of the nucleus of each tumour is given. I want to find the mean concavity for all malignant tumours, and for all benign tumours, separately.

Edit: first 25 rows of columns A and B given below as an example

> df2
    data2.diagnosis data2.concavity_mean
1                 M            0.3001000
2                 M            0.0869000
3                 M            0.1974000
4                 M            0.2414000
5                 M            0.1980000
6                 M            0.1578000
7                 M            0.1127000
8                 M            0.0936600
9                 M            0.1859000
10                M            0.2273000
11                M            0.0329900
12                M            0.0995400
13                M            0.2065000
14                M            0.0993800
15                M            0.2128000
16                M            0.1639000
17                M            0.0739500
18                M            0.1722000
19                M            0.1479000
20                B            0.0666400
21                B            0.0456800
22                B            0.0295600
23                M            0.2077000
24                M            0.1097000
25                M            0.1525000

How do I ask R to give me "the mean of rows in column B, given their value in column A is M" and then "given their value in column A is B"?

R_newb
  • 11
  • 4
  • 2
    Welcome to Stack Overflow! Can you please read and incorporate elements from [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Especially the aspects of using `dput()` for the input and then an explicit example of your expected dataset? – wibeasley Jan 16 '22 at 18:21
  • Are you asking just how to calculate the mean by groups? – camille Jan 17 '22 at 00:05

3 Answers3

0

Assuming your variable A is a factor, a base R approach for the example dataframe example would be

example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))

mean(example$B[example$A == 'M'])
#> [1] 2

# for both factor levels simultaneously you can use 
by(example$B, example$A, mean)
#> example$A: B
#> [1] 3
# ---- #
#> example$A: M
#> [1] 2

Note. Created on 2022-01-16 by the reprex package (v2.0.1)

Pax
  • 664
  • 4
  • 23
0

Copying one of the examples of the above users (who have provided valid solutions), I am just providing a few alternative solutions using the tidyverse package

example <- data.frame(A = as.factor(c('M','B','M', 'B')), B=c(1,2,3,4))

#first example creates a new table with summarized values
example %>% #takes your data table
  group_by(A) %>% #groups it by the factors listed in column A
  summarize(mean_A=mean(B)) #finds the mean of each subgroup (from previous step)

If you found this or any of these answers as helpful, please select it as final answer.

alejandro_hagan
  • 843
  • 2
  • 13
-1

As pointed in the comments, it would be nice to have a reproducible example and your data (or at least a subset of them) to see what are you dealing with.

Anyway, the solution to your problem should resemble the following (I am using simulated data):

set.seed(1986)

dta = data.frame("type" = c(rep("B", length = 5), rep("M", length = 5)), "nucleus" = rnorm(10))

mean(dta$nucleus[dta$type == "B"]) # Mean concavity for benign.
mean(dta$nucleus[dta$type == "M"]) # Mean concavity for malign.

Basically, I am just applying the mean() function to two subsets of the data, by selecting rows with the [] operator.

EDIT

Now that we have an idea of your actual data, I can provide a complete solution:

mean(dta$data2.concavity_mean[dta$data2.diagnosis== "B"]) # Mean concavity for benign.
mean(dta$data2.concavity_mean[dta$data2.diagnosis== "M"]) # Mean concavity for malign.
riccardo-df
  • 512
  • 4
  • 9
  • Thank you for this, but I don't quite understand what the first line of code in your example is doing/defining? I tried to replicate it with my actual data as follows: ```df2 = data.frame("type" = c(rep("B", length = 357), rep("M", length = 212)), "concavity_mean" = rnorm(569)) mean(df2$concavity_mean[df2$type == "B"]) # Mean concavity for benign. mean(df2$concavity_mean[df2$type == "M"]) # Mean concavity for malign. ``` How do I correctly write the first line to define the two subsets of data I want R to get the mean of? – R_newb Jan 16 '22 at 19:01
  • Do you mean `set.seed()`? That is needed when randomness is present in your code. If you provide your seed, others will be able to reproduce your exact results. Try running `rnorm(1)` several times, without setting any seed, and see what happens. Anyway, I tried the code you provided here, and it works for me. What do you not understand? – riccardo-df Jan 17 '22 at 08:58
  • Also, now that we have your actual data, I am going to edit my answer so to use your column names. – riccardo-df Jan 17 '22 at 08:59