1

Hi I have some data I am reading in from a csv, which is set out in binary form:

   1 2 3 4...N
1  0 1 0 1...1
2  1 1 0 1...1
3  0 0 0 0...0
4  1 0 1 1...1
.  1 1 1 0...1
.  1 0 0 0...1
N  0 0 1 1...0

screenshot of str(data)

I want to take a subset of this data where the sum of the row vectors is greater than a number say 10, or x. The first column is a placeholder column for customer ID, so this needs to be excluded. Do you have any suggestions about how I could go about doing this?

I've been trying various things like df=subset() but I've not been able to get the syntax correct.

Thanks in advance.

1 Answers1

1

We can do this with rowSums

df1[rowSums(df1) > 10, , drop = FALSE]
#  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
#7  0  0  0  1  0  0  1  1  0   1   1   1   1   1   0   0   0   1   1   1
#9  1  1  1  1  0  0  1  0  0   0   0   1   1   0   0   1   1   1   0   1

Update

In the OP's dataset, the first column 'X' is not binary and have bigger numbers. So, when we include that variable, the rowSums would be greater than 10. It is the index ID and not to be used in the calculation. So, by removing it in the rowSums, it would subset well

df1[rowSums(df1[-1])> 10,]

data

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 10* 20, replace = TRUE), ncol = 20))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi @akrun thanks for your response. I've had a go at your suggestion - I've read my data in: `data=read.csv("data.csv",header=TRUE)` I've then called the method you've suggested: `df=data[rowSums(data)>10]`. When I look at the structure of data and df, they are both the same size: 317 obs of 989 variables. I expected this to be smaller for df? – Andrew Buchanan Apr 03 '18 at 14:50
  • You didn't had the `,` i.e. `data[rowSums(data)>10,]` Without it, it thinkg that the index is column by default – akrun Apr 03 '18 at 14:52
  • I've added the commas in - still doesn't seem to work. I am wondering if it's the structure of the data frame or something. – Andrew Buchanan Apr 03 '18 at 15:04
  • @AndrewBuchanan What is the class of the columns. Is it numeric or not? – akrun Apr 03 '18 at 15:05
  • @AndrewBuchanan I added some example for you to look at it – akrun Apr 03 '18 at 15:07
  • Hi @akrun I've added a screenshot to my original post. – Andrew Buchanan Apr 03 '18 at 15:11
  • @AndrewBuchanan You have 989 variables. and I would guess that there would be more than 10 1's per each row. – akrun Apr 03 '18 at 15:15
  • @AndrewBuchanan Based on the str, the first column i.e. X is not binary.. It would invariably creates the problem – akrun Apr 03 '18 at 15:16
  • yes you'd think that, but not always! I've tried the same exercise in excel. It should remove 178 rows - the largest row has a sum of 287. I thought it might be the case - the first column is indicating the account number - I just left it blank but I'm guessing I should label this. As you can probably tell I am very new to R – Andrew Buchanan Apr 03 '18 at 15:18
  • @AndrewBuchanan Based on the image, the X variable values are 1, 3, 14, 21, etc. so, the 3rd and 4th are already greater than 10. the other two the odds are higher for it to greater than 10 if there are 989 variables – akrun Apr 03 '18 at 15:20
  • @AndrewBuchanan Try `df1[rowSums(df1[-1])> 10,]` – akrun Apr 03 '18 at 15:21
  • Or lets look at the rowSums i.e. `rowSums(df1[-1])` and see how many are greater than 10 – akrun Apr 03 '18 at 15:23
  • 1
    Yes that looks like it has worked! Thank you so much @akrun ! Are you happy to amend your answer so that I can accept it? – Andrew Buchanan Apr 03 '18 at 15:25
  • @AndrewBuchanan Thanks for the update. I updated the post – akrun Apr 03 '18 at 15:31