-1

I have txt file with the following format:

(4, 'AF', 'AFG', 'Afghanistan'),
(248, 'AX', 'ALA', 'Aland Islands'),
               .
               .
               .

I want to extract the number and the country. My idea is to use gsub with "[^0-9]" to find the number and something like tail(strsplit()) to extract the last word, after offcure I have removed all the special characters. Is there a a quicker way?

Data:

structure(list(V1 = c("(4, 'AF', 'AFG', 'Afghanistan'),", "(248, 'AX', 'ALA', 'Aland Islands'),", 
"(8, 'AL', 'ALB', 'Albania'),", "(12, 'DZ', 'DZA', 'Algeria'),", 
"(16, 'AS', 'ASM', 'American Samoa'),", "(20, 'AD', 'AND', 'Andorra'),"
)), .Names = "V1", row.names = c(NA, 6L), class = "data.frame")
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • 4
    try a `strsplit` on the `,` then take the first and the fourth column ? – etienne Nov 17 '15 at 13:31
  • @mpizosdimitris can you put a dput of (the head of) your data in the question? Makes solving things easier. – Heroka Nov 17 '15 at 14:18
  • [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – zx8754 Nov 17 '15 at 14:31
  • I did not asked for a solution. I asked for an alternative method. I don't see a reason why you downvoted. Anyway, thanks for your feedback. – Mpizos Dimitris Nov 17 '15 at 14:33

1 Answers1

0

If your data.frame is called df, here is a way using regex:

Get the first number:

sub("^\\((\\d+).*", "\\1", df$V1)
#[1] "4"   "248" "8"   "12"  "16"  "20"

Get the country:

sub("[^a-z]+([A-Z][a-z A-Z]+).+", "\\1", df$V1)
#[1] "Afghanistan"    "Aland Islands"  "Albania"        "Algeria"        "American Samoa" "Andorra"
Cath
  • 23,906
  • 5
  • 52
  • 86