2

I'm new to R. Just learning via online tutorials. My question is:

1) Why does accessing the same column with different syntaxes have different output presentation?

Vertical Display:

> airquality["Ozone"]
    Ozone
1      41
2      36
3      12

Horizontal Display:

airquality$Ozone
[1]  41  36  12  18  NA  28  23  19   8 
[46]  NA  21  37  20  12  13  NA  NA  NA
[91]  64  59  39   9  16  78  35  66 122

2) Why do the following have different data types?

> class(airquality["Ozone"])
[1] "data.frame"
> class(airquality$Ozone)
[1] "integer"
> class(airquality[["Ozone"]])
[1] "integer"
smci
  • 32,567
  • 20
  • 113
  • 146
Viki
  • 107
  • 1
  • 8
  • http://stackoverflow.com/questions/1169456/the-difference-between-and-notations-for-accessing-the-elements-of-a-lis – jogo Apr 29 '17 at 11:50
  • 1
    @jogo: No, that question is about accessing lists, not dataframes. (Yeah we know dataframes are also implemented as lists, under-the-hood, but that'll blow new users minds. Anyway the syntax for slicing dataframes is different, it allows an optional row-index too) – smci Apr 29 '17 at 12:01
  • @jogo: actually, its title is misleading (I just fixed it), the body does say "list or dataframe". However it doesn't mention '$' so it's not a superset of this. Also it doesn't have any code example so it sucks as a canonical question. Sigh... How the hell has one of the most basic R questions been here for 8 years and nobody bothered to fix the title? – smci Apr 29 '17 at 12:11
  • And you can't do `vector$index`. But you can do `dataframe$colname` – smci Apr 29 '17 at 12:18
  • @smci I gave the link as additional information - not meaning that the question is a dupe of it. – jogo Apr 29 '17 at 12:49
  • @jogo oh I see. Isn't it weird that we can't seem to find any canonical question...? – smci Apr 29 '17 at 14:30

1 Answers1

5

Same reason for both: airquality["Ozone"] returns a dataframe, whereas airquality$Ozone returns a vector. class() shows you their object types. str() is also good for succinctly showing you an object.

See the help on the '[' operator, which is also known as 'extracting', or the function getElement(). In R, you can call help() on a special character or operator, just surround it with quotes: ?'[' or ?'$' (In Python/C++/Java or most other languages we'd call this 'slicing').

As to why they print differently, print(obj) in R dispatches under-the-hood an object-specific print method. In this case: print.dataframe, which prints the dataframe column(s) vertically, with row-indices, vs print (or print.default) for a vector, which just prints the vector contents horizontally, with no indices.

Now back to extraction with the '[' vs '$' operators:

The most important distinction between ‘[’, ‘[[’ and ‘$’ is that the ‘[’ can select more than one element whereas the other two ’[[’ and ‘$’ select a single element.

There's also a '[[' extract syntax, which will do like '$' does in selecting a single element (vector):

airquality[["Ozone"]]
[1]  41  36  12  18 

The difference between [["colname"]] and $colname is that in the former, the column-name can come from a variable, but in the latter, it must be a string. So [[varname]] would allow you to index different columns depending on value of varname.

Read the doc about the exact=TRUE and drop=TRUE options on extract(). Note drop=TRUE only works on arrays/matrices, not dataframes, where it's ignored:

airquality["Ozone", drop=TRUE]
In `[.data.frame`(airquality, "Ozone", drop = TRUE) :
  'drop' argument will be ignored

It's all kinda confusing, offputting at first, eccentrically different and quirkily non-self-explanatory. But once you learn the syntax, it makes sense. Until then, it feels like hitting your head off a wall of symbols.

Please take a very brief skim of R-intro and R-lang#Indexing HTML or in PDF. Bookmark them and come back to them regularly. Read them on the bus or plane...

PS as @Henry mentioned, strictly when accessing a dataframe, we should insert a comma to disambiguate that the column-names get applied to columns, not rows: airquality[, "Ozone"]. If we used numeric indices, airquality[,1] and airquality[1] both extract the Ozone column, whereas airquality[1,] extracts the first row. R is applying some cleverness since usually strings aren't row-indices.

Anyway, it's all in the doc... not necessarily all contiguous or clearly-explained... welcome to R :-)

smci
  • 32,567
  • 20
  • 113
  • 146
  • 2
    Also worth comparing what you get from `airquality["Ozone"]` (a dataframe) with what you get with an extra comma `airquality[, "Ozone"]` (a vector like `airquality$Ozone`) while both `airquality[c("Ozone","Temp")]` and `airquality[, c("Ozone","Temp")]` give dataframes – Henry Apr 29 '17 at 11:23
  • 1
    Thanks "smci". Your explanation makes a lot of sense. Its true that all these varieties are confusing, especially why-when it will be required? As of now, I think, this variety is provided to give more freedom to programmers over data manipulation. I think, all this variety will make sense once I get more into complex data manipulations Thanks for the document links. Much to go. – Viki Apr 29 '17 at 11:51
  • 1
    You're welcome. My suggested quickstart is skim the manuals, and read any of the hundreds of quickstart guides or tutorials out there (try a few, if you don't like one, read another). Also, know how to get `help()` on any operator, function or package. And look at `vignette('PackageName')` for a quickstart on a particular package. And use SO, R-bloggers, github, Kaggle etc. to find good code examples. – smci Apr 29 '17 at 11:55