1

I have a panel data consisting of 180 countries in a .csv file, and I would like to create a subset of the 180 countries to run regressions based on the subset.

Here is a screenshot of my dataset: http://i.imgur.com/e3s3XVn.png

I have been toying with the subset function but I can't seem to get it to work correctly.

Ultimately, how should I go about creating a subset that just includes, for example, "Albania", "United States" while keeping other columns the same.

Thank you for any suggestions.

joran
  • 169,992
  • 32
  • 429
  • 468
Sam Chu
  • 37
  • 1
  • 5

2 Answers2

7

This is very basic subsetting, and you can find several answers on SO and in any introductory manual.

Assuming you have read your csv file in as an object named "df", something like this should do the job:

df[df$country %in% c("United States", "Albania"), ]

In the future:

  1. Screenshots of your data are of little use. Please use something like dput(head(yourdata)) instead.
  2. Show what you have tried. Don't simply write "I've been toying with the subset function". If you want to use the subset function in particular but haven't had success, it is helpful to show what you have done to help others troubleshoot.

A minimal example

Sample data:

set.seed(1)
df <- data.frame(country = sample(letters[1:5], 15, replace = TRUE),
                 somerandomvalue = rnorm(15),
                 anotherrandomvalue = rnorm(15))

Some summary data about the "country" column. Shows us that there are five unique countries, and there are 15 cases (rows) overall.

> summary(df$country)
a b c d e 
2 5 1 4 3 

Take just a subset:

> df[df$country %in% c("a", "b"), ]
   country somerandomvalue anotherrandomvalue
1        b    -0.005767173         0.80418951
2        b     2.404653389        -0.05710677
5        b    -1.147657009        -0.69095384
10       a    -0.891921127        -0.43331032
11       b     0.435683299        -0.64947165
12       a    -1.237538422         0.72675075
14       b     0.377395646         0.99216037

Or, using the subset function:

subset(df, country %in% c("a", "b"))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • One more related question, let's say I have picked out the countries I wanted using the subset function. How can I group the others that I left out into another data.frame? – Sam Chu Mar 11 '13 at 19:06
  • @SamChu, the exclamation mark is "not", so you can do something like `countriesIWant <- df[df$country %in% c("a", "b"), ]; allTheOtherCountries <- df[!df$country %in% c("a", "b"), ]`. Notice the `!` in there. Try it on the sample data I've provided in my answer and see how it works. – A5C1D2H2I1M1N2O1R2T1 Mar 11 '13 at 19:10
2

Try using subset function

   subset(YourData, country=c('Albania', 'United States'))

See ?subset for further details.

An example: (Edit thanks to @Roman Luštrik and Ananda's comments)

> Data <- data.frame(Country=rep(letters[1:6], each=3), random=rnorm(18))
    > subset(Data, Country %in% c('a','b'))
  Country      random
1       a -1.02159357
2       a -0.88256998
3       a -0.24138579
4       b  0.35844584
5       b  0.05288194
6       b -1.09724481
> subset(Data, Country == "a" | Country == "b")
  Country      random
1       a -1.02159357
2       a -0.88256998
3       a -0.24138579
4       b  0.35844584
5       b  0.05288194
6       b -1.09724481

Here you will learn how to make a nice reproducible example for illustrating your question.

Community
  • 1
  • 1
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • Thanks, I do apologize for not making this easily readable/understandable. – Sam Chu Mar 11 '13 at 17:50
  • 3
    If you wanted to get away with subset like that, you should probably do `subset(Data, Country == "a" | Country == "b")`. Your current setup gives some funky recycling (hat tip to Ananda). – Roman Luštrik Mar 11 '13 at 18:15
  • @AnandaMahto and Roman Luštrik thanks for the comments I was wrong, I've just updated my answer ;) – Jilber Urbina Mar 12 '13 at 09:56