Creating a dummy variable according to data in a matrix in R

Question

I have a dataframe with 1000 observations belonging to n different countries. Each country has more than 1 observation and the number of observations of each country differ. I need to create a column with numbers going from (1 to n-1), with each number corresponding to a different country. That is, I am creating a dummy variable and I don't care which country has which number. I just need to create such dummies. My data are something like this

  Region     x
1    be1 71615
4  be211 54288
5  be112 51158
6  it213 69856
8  it221 71412
9  uk222 79537
10 de101 94827
11 de10a 98273
12 dea10 92827
..    ..    ..

Each country has its own "code" in the column Region, for instance beXXXX correpsonds to Belgium, ukXXX to the United Kingdom and so on. Hence I suppose I could exploit the initial 2 letters in the column Region to create my dummies. I know from here that the command grep() could do the job, but I need to have a script which automatically switches from 1 to n-1 whenever the initial letters of the Region change.

The expected output should be like this

 Region     x   Dummy
1    be1 71615      1
4  be211 54288      1
5  be112 51158      1
6  it213 69856      2
8  it221 71412      2
9  uk222 79537      3
10 de101 94827      4
11 de10a 98273      4
12 dea10 92827      4
..    ..    ..     ..

and in this case 1 corresponds to "be" (Belgium), 2 to "it" (Italy) and so on for the ´n´countries in my sample.

You are right. I posted the expected output. I want to stress that the dataframe is ordered by region (and hence by country), that is I have first all the beXXX observations, then the itXXXX and so on. Maybe this can be exploited to make things simpler. — Bob, Sep 16 '13 at 10:38

Simon O'Hanlon · Accepted Answer · 2013-09-16T10:58:03.833

5

How about creating a factor variable (you can show the underlying integer codes with as.integer). We use regexec and regmatches to extract the letter codes that occur at the beginning of the Region variable (ignoring letters that occur later) and turn them into the factor...

#  Data with an extra row (row number 11)
df <- read.table( text = "  Region     x
1    be1 71615
4  be211 54288
5  be112 51158
6  it213 69856
8  it221 71412
9  uk222 79537
11  uk222a 79537
10 de101 94827" , h = T , stringsAsFactors = FALSE )

levs <- regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) )

df$Country <- as.integer( factor( levs , levels = unique(levs ) ) )

   Region     x Country
1     be1 71615       1
4   be211 54288       1
5   be112 51158       1
6   it213 69856       2
8   it221 71412       2
9   uk222 79537       3
11 uk222a 79537       3
10  de101 94827       4

unlist( regmatches( df$Region , regexec( "^[a-z]+" , df$Region ) ) )
[1] "be" "be" "be" "it" "it" "uk" "uk" "de"

edited Sep 16 '13 at 10:58

answered Sep 16 '13 at 10:41

Simon O'Hanlon

58,647
14
142
184

Hello. Thank you for the insight. However, I still have a problem. I need that the code considers just the INITIAL letters, since I have regional codes as follows: uk218, uk219, uk21a (it goes from 1 to 9, then siwtches to letters) The code you provided creates a variable named (uk) and another named (uka). I just need (uk). About the order, That's not really important, but it can be of use for somebody else. Thank you – Bob Sep 16 '13 at 10:52
You are right. I have >1000 observations and forgot about it. Going to edit the question. Thank you – Bob Sep 16 '13 at 11:00
@Bob it should work now. You can just use the example data i put in my answer - it would save me from having to re-edit (again!). – Simon O'Hanlon Sep 16 '13 at 11:00
@SimonO101. With your data things are fine. With mine I get this Error in regexec("^[a-z]+", code.nuts3$Region) : invalid 'text' argument I suppose it is my fault yet: can it be due to the fact I have this kind of codes: de111, de11a and dea11 ? Thank you again – Bob Sep 16 '13 at 11:13
@Bob note the `stringsAsFactors` argument in `read.table`. `Region` should be a string but it is being treated as a factor. Alternatively use `regexec("^[a-z]+", as.character( code.nuts3$Region ) )` – Simon O'Hanlon Sep 16 '13 at 11:28

agstudy · Answer 2 · 2013-09-16T11:26:03.827

2

Another option using gsub is :

gsub('.*(^[a-z]{2}).*','\\1',c('de111', 'de11a','dea11'))
"de" "de" "de"

Then you use factor and as.integer as showed in the previous answer.

edited Sep 16 '13 at 11:26

answered Sep 16 '13 at 11:19

agstudy

119,832
17
199
261

1

+1 this will be fairly robust, assuming all country codes are two letters. – Simon O'Hanlon Sep 16 '13 at 11:52
May you explain further the usage of '\\1' please? Thank you – Bob Sep 16 '13 at 13:32
\\1 to take what is grouped between parentheses in the regex pattern. – agstudy Sep 16 '13 at 13:39

Creating a dummy variable according to data in a matrix in R

2 Answers2