I am having issues trying to delete rows from a dataset that seems to only have one column - thus it is like a column vector. I am trying to do two things, it doesn't matter which first (to me). Here is a sample of the data:
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?
I am bringing in the data like this:
sampledata <- read.table("house-votes-84.data",)
but I want turn the string of data for each rows in to a column and give that column a name. Now I know I can name things by the following:
names(sampledata) <- c("col1" ...., "col17")
but can only do so if there are 17 columns. Each of the would be columns are separated by a comma.
Second, I am trying rid myself of the rows with a questions mark ?
What I have tried?
I have tried things such as (with my data called sample data):
sampledata[apply(sampledata[, -1], MARGIN = 1, function(x) all(x != "?")), ]
That doesn't work (and I am guessing because there is only one Column so that Margin would have to be something that each of the columns would be looked through (I tried -1 for Margin but to no avail)
I have tried changing the ?'s to NA's and use the
na.omit(sampledata)
That doesn't work either.
I have tried the parsing by commas such as
splitting <- strsplit(as.character(sampledata$V1), split=",")
where V1 is the single column name. That is the most interesting result as I get
435 of the following (there are 435 rows of data)
_[[435]]_
_ [1] "republican" "n" "y" "n" "y" _
_ [6] "y" "y" "n" "n" "n" _
_[11] "y" "n" "y" "y" "y" _
_[16] "NA" "n"_
But when I try to change the name's: Error in
names(sampledata) <- c("col1", "col2", "col3", "col4", "col5", : 'names' attribute [17] must be the same length as the vector [1]
I have tried other things, such as trying to turn it in to a dataset - this however seems to turn all the values in to numbers that look to be randomized (not something such as 0, 1, or 99 for the ?, but values even up to 100, maybe more)
I am just trying to get the data in the correct format so that I can run a regression on the samples that don't have question marks.
The sites that I have had best luck with are Stack Exchange Also and are here:
subset rows with all / any columns larger than a specific value
And Here:
Convert comma separated entry to columns
With the first, I can get it to work but even with that I am generating the data in the 3 column's in the code itself - I can't seem to get the same code to work on my ?'s (although I can get the program to remove the rows with question marks using:
X <- data.frame(Variable1=c(11,"?",12,15),Variable2=c(2,3,1,4))
X[X$Variable1!="?", ]
I have been trying to figure out a way to make the code, row by row do the same thing for the imported data as I am pulling it in as a data.frame also (I realize I only have 1 column, and the column is called V1, so I changed the code likewise to
X$V1
sampledata <- read.table("house-votes-84NaN.data.txt")
splitdat = do.call("rbind", strsplit(sampledata$V1, ","))
But I get
**Error in strsplit(sampledata$V1, ",") : non-character argument**
I do realize that I need more arguments (I think) in read.table as they have some more - but I don't understand what needs to go in.
Any help would be very much appreciated.
Thank you,
Brian