1

I apologize because it seems this question has been asked many times, but I have read through several questions and answers and tried different solutions and am still having problems, so I hope someone can help!

I have a dataframe with nearly 30 million observations (rows) and 6 variables (columns), and I want to delete the last ~5 million observations.

I have tried the following three proceedures:

#read in the csv
data <-read.csv('mydata.csv')

#try this
#delete specified rows
dataresized <- data[-24579580:-29495496]

#try this instead
#keep only first 24549580 rows (x=id or rownumber)
dataresized2 <- subset(data, "X" < 24579581)

#try this instead
unwantedrows <- data %in% 24579580:29495496
dataresized3 <- data[!unwantedrows] 

The first code didn't seem to do anything -i.e., no rows were removed. The second option seemed to remove everything, i.e, no rows remained. The third option seemed to crash the system.

Any suggestions would be greatly appreciated! Thanks!

FXQuantTrader
  • 6,821
  • 3
  • 36
  • 67
user3251223
  • 265
  • 1
  • 7
  • 15
  • 5
    with data frames, you have to use both index arguments (e.g. `[-(1:10), ]`, notice the comma and the blank), otherwise you're removing columns. Also, for subset, you should just use the variable name. – BrodieG Jan 30 '14 at 23:30
  • Thank you! adding the comma and blank to the index argument worked perfectly :) – user3251223 Jan 31 '14 at 00:05
  • 1
    Tho' in your case (removing the "bottom" of the dataframe), there's no real need to use negative indexing. `data<-data[1:N,]` will work just as well. – Carl Witthoft Jan 31 '14 at 02:01

0 Answers0