1

I'm new to R and I'm trying to get my script more efficient. I have a data.frame of 25480 observations and 17 variables.

One of my variables is Subject and each subject has its number. However, the number of observations (lines) for each subject is not equal. I would like to separate my subjects into groups, according to their number. How can I do it?

Before I used this formula:

gaze <- subset(gaze, Subject != "261" & Subject != "270" & Subject != "275") 

But now I have too many subjects to repeat Subject each time. Is it possible to define interval of subjects to cut or to split. I tried this command but it doesn’t seem to work:

gazeS <- (gaze$Subject[112:216])
cut(gaze, seq(gaze, from = 112, to = 116))

Could you help me to fix this code, please?

Arun
  • 116,683
  • 26
  • 284
  • 387
Ewa Karolina P
  • 11
  • 1
  • 2
  • 3
  • 3
    Please provide a reproducible example (subset) of your data. Cut could be used if your data is intervl but we don't know what it is. – Tyler Rinker Feb 17 '13 at 18:55
  • You're terminology is vague.. When you say '( lines )', do you mean rows? And you say the problem is unequal number of observations, but you want to separate by number, wouldn't you want to separate by number of observations? There is a solution, but you would need to edit your question above to include a small example set of data which exemplifies what you're issue is. – N8TRO Feb 17 '13 at 19:44
  • [Echoing Tyler](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), please edit your question and include a copy of `dput(head(gaze))`. Adding `str(gaze)` may help as well. – Blue Magister Feb 17 '13 at 22:24

2 Answers2

1

Since there is no ordering method for factor variables (even if they appear numeric) you need to convert first for any ordering operation to work and the R-FAQ says to use :

as.numeric(as.character(fac))

So:

subset(gaze, !as.numeric(as.character(Subject)) in 260:280)

Or:

subset(gaze, !( as.numeric(as.character(Subject)) >= 260 &
            as.numeric(as.character(Subject)) <= 280)  )

Or:

subset( gaze, !Subject %in% as.character(260:280) )
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

If I correctly understand what you need, you could use something like

gaze$Subject <- as.integer(as.charachter(gaze$Subject))
gaze <- subset(gaze, Subject >= 261 & Subject <= 280) 

It is important to cast the id as character otherwise funny things may happen with factor levels being ordered alphabetically and not numerically. The best thing to avoid this, however, is to directly set column classes when reading the data (e.g. with the colClasses parameter of read.table).

nico
  • 50,859
  • 17
  • 87
  • 112
  • Hi, Thanks a lot for your answer. I tried it but unfortunately it doesn't work: I'm getting the message that it's not pertinent for factor variables:1: In Ops.factor(Subject, 212) : >= ceci n'est pas pertinent pour des variables facteurs, I also tried with qoutation mark as before, but it doesn't help. any other ideas, I would be grateful... – Ewa Karolina P Feb 17 '13 at 18:43
  • 1
    @EwaKarolinaP If your subject IDs are all numbers, you could do `gaze$Subject <- as.numeric(as.character(gaze$Subject))` before trying this. However, it is still unclear what you want to achieve. – Roland Feb 17 '13 at 18:51
  • @EwaKarolinaP: evidently the IDs have been read as strings rather than numbers. I have updated the answer, does it work like that? – nico Feb 17 '13 at 19:06
  • Ok, Sorry. I’ll try to be more specific. Here is just a little part of my data as an example. Subject CURRENT_FIX_DURATION CURRENT_FIX_START CURRENT_FIX_END CURRENT_FIX_X CURRENT_FIX_Y class Condition EYE_USED ACC 200 292 4 294 531 395.4 CP TH RIGHT 1 200 142 364 504 202.5 97.8 CP TH RIGHT 1 200 388 522 908 251.6 101.3 CP TH RIGHT 1 200 162 950 1110 495.3 369.5 CP TH RIGHT 1 200 510 1126 1634 530.4 391.2 CP TH RIGHT 1 200 184 1680 1862 290 167.4 CP TH – Ewa Karolina P Feb 17 '13 at 19:19
  • Sorry, I dont manage to paste my table properly... There are some subjects that I would like to delete. Before I used to do it with subset function: gaze<-subset (gaze,Subject!="261" & Subject!="270"& Subject!="275"). Now basically what I want to do is gaze<-subset (gaze, Subject!="261" & Subject!="262" & Subject!="263"& Subject!="264" & Subject!="265"& Subject!="266"……..) and many more like that. I would like to know if it is possible to give in interval to delete: for example (Subject!=[“260:280”] instead of repeating Subject!="261" twenty times. I hope I’m more clear.Thanx! – Ewa Karolina P Feb 17 '13 at 19:23
  • I disagree: Using as.integer will only be correct if the subject numbering starts at 1 and is dense in the integers, and even then will not work because "9" will be "greater than" "10". It might appear to work in the range under consideration but it is VERY UNSAFE. Please give safe advice, @nico. – IRTFM Feb 17 '13 at 20:57
  • @DWin: I don't see how it is unsafe. `as.integer("9") < as.integer("10")` returns TRUE as expected. Comparing them as strings does not have any sense. Also can you elaborate on why it should only work if numbering starts at 1? – nico Feb 17 '13 at 21:42
  • I was under the impression that "comparing levels as strings" was in fact what factors did: `> as.integer(factor(as.character(1:20))) [1] 1 12 14 15 16 17 18 19 20 2 3 4 5 6 7 8 9 10 11 13` – IRTFM Feb 17 '13 at 21:58
  • @DWin: sure, you get an unordered array, but `as.integer` drops the levels, if you do `a[9]>a[10]` you get TRUE as expected (20>2). – nico Feb 18 '13 at 08:43
  • What is NOT expected is to find the twentieth item (which was the character value "20" being represented by the integer 13: `as.integer(factor(as.character(1:20) ))[20] # [1] 13` – IRTFM Feb 18 '13 at 09:11
  • @DWin: doesn't matter, when you import the data the IDs are not necessarily sequential anyway. The best thing is to directly import the id as integers of course. – nico Feb 18 '13 at 10:53
  • It doesn't matter that "20" is now going to to be acted upon as 13??? That a mightly strange view of computing. – IRTFM Feb 18 '13 at 16:57
  • @DWin: upon doing a bit of testing I see what you mean now. You are right. I corrected the code to avoid the problem (which lies in factor not in as.integer). – nico Feb 18 '13 at 18:19