R Parsing a data set to both get rid of rows on certain value as well as turn make columns based on a character

Question

I am having issues trying to delete rows from a dataset that seems to only have one column - thus it is like a column vector. I am trying to do two things, it doesn't matter which first (to me). Here is a sample of the data:

republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?

I am bringing in the data like this:

sampledata <- read.table("house-votes-84.data",)

but I want turn the string of data for each rows in to a column and give that column a name. Now I know I can name things by the following:

names(sampledata) <- c("col1" ...., "col17")

but can only do so if there are 17 columns. Each of the would be columns are separated by a comma.

Second, I am trying rid myself of the rows with a questions mark ?

What I have tried?

I have tried things such as (with my data called sample data):

sampledata[apply(sampledata[, -1], MARGIN = 1, function(x) all(x != "?")), ]

That doesn't work (and I am guessing because there is only one Column so that Margin would have to be something that each of the columns would be looked through (I tried -1 for Margin but to no avail)

I have tried changing the ?'s to NA's and use the

na.omit(sampledata)

That doesn't work either.

I have tried the parsing by commas such as

splitting <- strsplit(as.character(sampledata$V1), split=",")

where V1 is the single column name. That is the most interesting result as I get

435 of the following (there are 435 rows of data)

_[[435]]_  
_ [1] "republican" "n"          "y"          "n"          "y"        _  
_ [6] "y"          "y"          "n"          "n"          "n"       _
_[11] "y"          "n"          "y"          "y"          "y"  _
_[16] "NA"         "n"_

But when I try to change the name's: Error in

names(sampledata) <- c("col1", "col2", "col3", "col4", "col5",  : 'names' attribute [17] must be the same length as the vector [1]

I have tried other things, such as trying to turn it in to a dataset - this however seems to turn all the values in to numbers that look to be randomized (not something such as 0, 1, or 99 for the ?, but values even up to 100, maybe more)

I am just trying to get the data in the correct format so that I can run a regression on the samples that don't have question marks.

The sites that I have had best luck with are Stack Exchange Also and are here:

subset rows with all / any columns larger than a specific value

And Here:

Convert comma separated entry to columns

With the first, I can get it to work but even with that I am generating the data in the 3 column's in the code itself - I can't seem to get the same code to work on my ?'s (although I can get the program to remove the rows with question marks using:

 X <- data.frame(Variable1=c(11,"?",12,15),Variable2=c(2,3,1,4))  
X[X$Variable1!="?", ]

I have been trying to figure out a way to make the code, row by row do the same thing for the imported data as I am pulling it in as a data.frame also (I realize I only have 1 column, and the column is called V1, so I changed the code likewise to

X$V1  

sampledata <- read.table("house-votes-84NaN.data.txt")
splitdat = do.call("rbind", strsplit(sampledata$V1, ","))

But I get

**Error in strsplit(sampledata$V1, ",") : non-character argument**

I do realize that I need more arguments (I think) in read.table as they have some more - but I don't understand what needs to go in.

Any help would be very much appreciated.

Thank you,

Brian

No matter what I tried I couldn't get this thing to let me post more details, said something about _uncommented code_ — Relative0, Dec 21 '12 at 18:55

Sven Hohenstein · Answer 1 · 2016-05-28T05:31:11.520

First, read your data with the function read.csv and the arguments header = FALSE and row.names = 1:

sampledata <- read.csv(text="republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?", header = FALSE, row.names = 1)

Then you can transpose the data frame with t:

t(sampledata)

The result:

    republican democrat
V2  "n"        "y"     
V3  "y"        "y"     
V4  "n"        "y"     
V5  "y"        "n"     
V6  "y"        "n"     
V7  "y"        "n"     
V8  "n"        "y"     
V9  "n"        "y"     
V10 "n"        "y"     
V11 "n"        "n"     
V12 "n"        "n"     
V13 "y"        "n"     
V14 "y"        "n"     
V15 "y"        "n"     
V16 "n"        "?"     
V17 "y"        "?"

You can remove the columns with question marks using

dat <- as.data.frame(t(sampledata))

dat[!apply(dat == "?", 2, any)]

    republican
V2           n
V3           y
V4           n
V5           y
V6           y
V7           y
V8           n
V9           n
V10          n
V11          n
V12          n
V13          y
V14          y
V15          y
V16          n
V17          y

I want to do this for all of the rows in the file - the two above were just examples. I need to do this automated. — Relative0, Dec 21 '12 at 19:04
I think I have found part of the answer!: d = read.table("house-votes-84.data", sep=",", ) — Relative0, Dec 21 '12 at 19:31

IRTFM · Accepted Answer · 2012-12-21T19:14:26.147

( think you probably do need to be more celar about the order of the transpose and hte removal operations. This does the removal first, but would give you a different result if you transposed first.

 dat <- read.table(text="republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
 democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?", sep=",")
 dat
#--------------------
          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 republican  n  y  n  y  y  y  n  n   n   n   n   y   y   y   n   y
2   democrat  y  y  y  n  n  n  y  y   y   n   n   n   n   n   ?   ?
#--------------
 dat[ ! apply(dat, 1, function (x) any(x=="?") ), ]
#----------------
          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 republican  n  y  n  y  y  y  n  n   n   n   n   y   y   y   n   y

I'm not sure why you would want to transpose this, but you can do so with the t function (transpose).

> t( dat[ ! apply(dat, 1, function (x) any(x=="?") ), ] )
    1           
V1  "republican"
V2  "n"         
V3  "y"         
V4  "n"         
V5  "y"         
V6  "y"         
V7  "y"         
V8  "n"         
V9  "n"         
V10 "n"         
V11 "n"         
V12 "n"         
V13 "y"         
V14 "y"         
V15 "y"         
V16 "n"         
V17 "y"

With the data in party-row order you could elimnate the questions with any "?" response in a column by using apply with a column extraction (put apply fn in column position and use 2 as the MARGIN argument):

> dat[ , ! apply(dat, 2, function (x) any(x=="?") ) ]
          V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 republican  n  y  n  y  y  y  n  n   n   n   n   y   y   y
2   democrat  y  y  y  n  n  n  y  y   y   n   n   n   n   n

I tried importing my data: sampledata <- read.table("house-votes-84NaN.data.txt") and putting sampledata in sampledata[apply(sampledata[, -1], MARGIN = 1, function(x) all(x != "?")), ] but I can not get it to work! — Relative0, Dec 21 '12 at 19:17
Your example data has commas and that is not the default separator. So add the `sep=","` to the read function. The second part of your code works to remove the "democrat" line if the data is in row-party order. — IRTFM, Dec 21 '12 at 19:20

R Parsing a data set to both get rid of rows on certain value as well as turn make columns based on a character

2 Answers2