0

I'm a newbie to R programming..I have a csv file contains items by country, life expectancy and region. And I've to do the following:

  1. List out no. of countries regionwise & draw bar chart
  2. Draw boxplot for each region
  3. Cluster countries based on life expectancy using k-means algorithm
  4. Name the countries that have the min & max life expectancy.

input.csv

Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe

What I did?

1.

mydata <- read.table("input.csv", header=TRUE, sep=",")
barplot(data$ncol(Region))

and I get the error Error in barplot(mydata$ncol(Region)) : attempt to apply non-function

  1. boxplot(LifeExpectancy~Region,mydata=data) ##This is correct

3 Have no idea how to do this!

4.min(mydata$LifeExpectancy);max(mydata$LifeExpectancy) ##This is correct

Ethan
  • 3
  • 1
  • 3
  • Please provide part of your data, like: head(data) – FFI Apr 09 '14 at 06:47
  • This really deserves to be split into multiple questions (after doing due diligence and searching for existing answers here and elsewhere online). The way I see it, the primary question here is: "why am I getting these errors?" I would take a good look at `?barplot`. It expects a vector of bar heights. This can be created by aggregating (`?aggregate`, `?tapply`, `?by`) the country column by region, and applying the `?length` function. Your `data$ncol(Region)` is quite incorrect ;) – jbaums Apr 09 '14 at 07:04
  • I think the first error appears when importing the csv file. We're waiting for a small, reproducible example (which will more help you than us): http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Apr 09 '14 at 07:07
  • Next, what do you expect the boxplots to show? Life expectancy per region, I would imagine. It doesn't make sense to try to create boxplots of country per region (how would you calculate the median and quartiles of country, for example?). You probably want: `boxplot(Life.Expectancy~Region,data=data)` (you'll want to change the name of the Life Expectancy column to "Life.Expectancy". – jbaums Apr 09 '14 at 07:08
  • @RomanLuštrik - the first error is triggered by `data$ncol(Region)`, since `data$ncol` is not a function. – jbaums Apr 09 '14 at 07:09
  • @jbaums I changed Life Expectancy to LifeExpectancy. `boxplot(LifeExpectancy~Region,data=data)` will give this error. `Error in as.data.frame.default(data, optional = TRUE):cannot coerce class ""function"" to a data.frame` And `boxplot(Country~Region,data=data)` will give this error. `Error in boxplot.default(split(mf[[response]], mf[-response]), ...):adding class "factor" to an invalid object`. I have to create box plots for each region. That was the question asked and I cant change it :( `summary(data))` will give me everything including the mean, median and quartiles of LifeExpectancy. – Ethan Apr 09 '14 at 08:15
  • Your first error occurs if your data frame is not actually called `data`. Sidenote: don't call your data "data" because `data` is a function in base R. Using names that are already in use leads to confusing error messages, and in some cases, masked objects. Regarding the second error, I mentioned already that it is nonsensical to plot `Country` as the response... you are specifying that you want to split `data` by `Region`, and then boxplot the `Country` vector for each of these splits. – jbaums Apr 09 '14 at 08:28
  • okey changed it. And I have the box plot now. Updated the question! Thanks for your input! I was being stupid there! Oh I'm still is! – Ethan Apr 09 '14 at 08:41

1 Answers1

1

As I pointed out in my comments, this question is really multiple questions, and does not reflect the title. In future, please try to keep questions manageable and discrete. I'm not going to attempt to answer your third point (about K-means clustering) here. Search SO and I'm sure you will find some relevant questions/answers.

Regarding your other questions, have a careful look at the following. If you don't understand what a particular function is doing, refer to ?function_name (e.g. ?tapply), and for further enlightenment, run nested code from the inside out (e.g. for foo(bar(baz(x))), you could examine baz(x), then bar(baz(x)), and finally foo(bar(baz(x))). This is an easy way to help you get a handle on what's going on, and is also useful when debugging code that produces errors.

d <- read.csv(text='Country,LifeExpectancy,Region
India,60,Asia
Srilanka,62,Asia
Myanmar,61,Asia
USA,65,America
Canada,65,America
UK,68,Europe
Belgium,67,Europe
Germany,69,Europe
Switzerland,70,Europe
France,68,Europe', header=TRUE)

barplot(with(d, tapply(Country, Region, length)), cex.names=0.8, 
        ylab='No. of countries', xlab='Region', las=1)

barplot

boxplot(LifeExpectancy ~ Region, data=d, las=1, 
        xlab='Region', ylab='Life expectancy')

enter image description here

d$Country[which.min(d$LifeExpectancy)]

# [1] India
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA

d$Country[which.max(d$LifeExpectancy)]

# [1] Switzerland
# Levels: Belgium Canada France Germany India Myanmar Srilanka Switzerland UK USA
jbaums
  • 27,115
  • 5
  • 79
  • 119
  • Thank you very much! Its like...I can understand whats happening after looking at the answer, but can't do it alone! I'll try completing the 3rd and update the answer in my question. – Ethan Apr 09 '14 at 09:00
  • @Ethan for the 3rd you have simply function `kmeans`. So for instance `k<-kmeans(d$LifeExpectancy,2) ; split(d,k$cluster)` split your countries in two groups based on their life expectancy. – plannapus Apr 09 '14 at 14:21