2

So, I'm trying to figure out a larger problem, and I think it may stem from exactly what's happening when I import data from a .txt file. My regular beginning commands are:

data<-read.table("mydata.txt",header=T)
attach(data)

So if my data has say, 3 columns with headers "Var1", "Var2" and "Var3", how exactly is everything imported? It seems as though it is imported as 3 separate vectors, then bound together, similar to using cbind().

My larger issue is modifying the data. If a row in my data frame has an empty spot (in any column) I need to remove it:

data <- data[complete.cases(data),]

Perfect - now say that the original data frame had 100 rows, 5 of which had an empty slot. My new data frame should have 95 rows, right? Well if I try:

> length(Var1)
[1] 100
> length(data$Var1)
[1] 95

So it seems like the original column labelled Var1 is unaffected by the line where I rewrote the entire data frame. This is why I believe that when I import the data, I really just have 3 separate columns stored somewhere called Var1, Var2 and Var3. As far as getting R to recognize that I want the modified version of the column, I think I need to do something along the lines of:

Var1 <- data$Var1 #Repeat for every variable

My issue with this is that I will need to write the above bit of code for every single variable. The data frame I have is large, and this way of coding seems tedious. Is there a better way for me to transform my data, then be able to call the modified variables, without needing to use the data$ precursor every time?

D'Arcy Mulder
  • 83
  • 2
  • 10

2 Answers2

7

read.table() reads the data into a data frame with a component (column) for each column (variable) in the text file. R's data frame is like an Excel spreadsheet, each column in the sheet can contain a different type of data (contrast that with a matrix, which in R can contain data only of a single type).

In effect, the result is as if the data were read in column by column and then bound together column-wise using the cbind.data.frame() method. This is not how it is done in practice though. You have a single object data with three components, none of which can be accessed by typing their name (e.g. Var1). Try exactly this

data <- read.table("mydata.txt", header = TRUE)
Var1

in a clean session (best if you start a new session to try this, just in case).

If you were to type ls() you would see only data listed (assuming a clean session). This is clearl evidence against your thinking that you have three columns and individual objects.

The real problem here is attach() not read.table().

There are very few good uses of attach() and the one you show is not among them. attach(data) places a copy of data on the search path. The key point there is copy. What is on the search path is not the same thing as data in the global environment (your workspace). Any changes to the data in the global environment are not reflected in the copy on the search path, because these are two, completely separate objects.

R has a search path where it looks for named objects. Normally R doesn't look inside objects and hence Var1 etc will not be found whenever you type their name at the prompt or attempt to use the object directly. When you attach() an object you can think of this as opening the object up to R's search. But the thing that catches people out is that one is now looking inside a copy of the object and not the object itself.

In interactive sessions, there are useful helper functions that mean you don't need to be typing data$ all the time. See ?with, ?within, ?transform for example.

Really don't use attach() in lieu of a bit of typing.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
3

I'm pretty sure R reads files row by row. (In fact, I think just about all programming languages work this way.) I wonder if you are attaching your data frame before removing the incomplete cases. The behavior you describe is fairly typical when people call attach(data) beforehand. In general, it is recommended that you do not use attach() at all in R. But if you must use it, call detach(data) first, then modify the data frame, and then (if you must) call attach(data) again. At that point, you will no longer have this problem.

Note, it is also possible that your problem is something different. However, we cannot tell, based on the information provided thus far. You will want to provide a reproducible example so that people can help you more effectively, see here: how-to-make-a-great-r-reproducible-example.

Community
  • 1
  • 1
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
  • "Do not use `attach()`" just about sums it up. – Ari B. Friedman Jul 22 '13 at 03:28
  • Ok, I tried deleted my attach() line, but the problem persists. I've read a good article about not using attach() and instead using data=my.data as an argument to avoid using my.data$Var1. When I learned R in my biostats course they always used attach(), so now I know better! And about making reproducible code - I did try my very best, but I can't figure out how to produce a data frame in r from scratch. I will explain in a 2nd comment. – D'Arcy Mulder Jul 22 '13 at 03:41
  • I tried: `x<-c(1,2,3) y<-c(4,5,6) data<-data.frame(x,y)` I can call x or data$x. If I `remove(x)` I can only then call `data$x` (calling `x` returns an error:object not found). Whereas if I import some data (like my original post) I can always just call `Var1` (no need to specify `data$Var1`). I think this is a symptom of me not understanding what happens when a .txt is imported. – D'Arcy Mulder Jul 22 '13 at 03:44
  • 1
    I appreciate your efforts, @darcy.mulder. I'm aware that courses often teach the `attach()` command. It can be useful, but it's also dangerous; it often trips people up, which is what I think happened to you. As far as deleting your `attach()` line, that wouldn't help if you've already run it. Try using the `detach()` etc. strategy I outline. – gung - Reinstate Monica Jul 22 '13 at 03:51
  • I cleared my workspace (using RStudio), ran the `detach()` code, and now I cannot simply call `Var1`. So now I understand what `attach()` does - it allows me to call columns of attached data frames. So if I don't attach my data, I cannot directly call `Var1`, but I need to specify `data$Var1`. I suspect this is why I was taught to use `attach()`. – D'Arcy Mulder Jul 22 '13 at 04:15
  • That's exactly right, @darcy.mulder. `attach()` attaches a *copy* of your variables to the search path. Think of printing out a word document that you are working on & putting it on your desk. You can change the `.doc` file, but the hardcopy on your desk *does not change*. – gung - Reinstate Monica Jul 22 '13 at 04:19
  • I think working through this thread has cleared up some confusion about what's going on when I import data. So, is it standard practice to write functions in the form `mean(Var1, data=my.data)` if I wish to avoid the use of `my.data$Var1`? – D'Arcy Mulder Jul 22 '13 at 04:21
  • As @Gavin Simpson notes, you can use `with()`, or other options, eg, depending on your situation. – gung - Reinstate Monica Jul 22 '13 at 04:25
  • Thanks for sticking with me...I learned a bunch of lines of code for my course, but not the inner workings of what I was doing. I'm trying to relearn from scratch so I can code more efficiently. – D'Arcy Mulder Jul 22 '13 at 04:33
  • Your welcome, @darcy.mulder. Learning to program by yourself in R isn't easy. You may want to get a good book and work through it. I think there's a question on that topic around here (or on SO) somewhere. – gung - Reinstate Monica Jul 22 '13 at 04:36
  • I've actually been coding for a few years now, but it was pretty patchwork. When I'm writing nested for loops and subsetting data, I know I'm doing something more complicated than need be. So I figured I'd just start back at the beginning and ensure I understood exactly what was going on at each step. – D'Arcy Mulder Jul 22 '13 at 05:01
  • Here is such a question on SO, if you're still interested: [how-to-learn-r-as-a-programming-language](http://stackoverflow.com/questions/1744861/). – gung - Reinstate Monica Jul 24 '13 at 02:02