How to index outliers?

Question

I have the data below. How can I determine which author has the highest number of publications?

I try this

   (which(status$researchers==max(status$publications))

but it doesn't seem to work.

#PUBLICATIONS

researchers = c("Smith", "Johnson", "Williams", "Brown", "Jones", "Miller", "Davis", "García", "Rodriguez", "Wilson", "Martinez", "Anderson", "Taylor", "Thomas", "Hernandez", "Moore", "Martin", "Jackson", "Thompson", "White", "Lopez", "Lee", "Gonzalez", "Harris", "Clark", "Lewis", "Robinson", "Walker", "Perez", "Hall", "Young", "Allen", "Sanchez", "Wright", "King", "Scott", "Green", "Baker", "Adams", "Nelson", "Hill", "Ramirez", "Campbell", "Mitchell", "Roberts", "Carter", "Phillips", "Evans", "Turner", "Stapel", "Torres", "Parker", "Collins", "Edwards", "Stewart", "Flores", "Morris", "Nguyen", "Murphy", "Rivera", "Cook", "Rogers", "Morgan", "Peterson", "Cooper", "Reed", "Bailey", "Bell", "Gomez", "Kelly", "Howard", "Ward", "Cox", "Diaz", "Richardson", "Wood", "Watson", "Brooks", "Bennett", "Gray", "James", "Reyes", "Cruz", "Hughes", "Price", "Myers", "Long", "Foster ", "Sanders", "Ross", "Morales", "Powell", "Sullivan", "Russell", "Ortiz", "Jenkins", "Gutierrez", "Perry", "Butler", "Barnes", "Fisher", "De Jong", "Jansen", "De Vries", "vd Berg", "Van Dijk", "Bakker", "Janssen", "Visser", "Smit", "Meijer", "De Boer", "Mulder", "De Groot", "Bos", "Smeesters", "Vos", "Peters", "Hendriks", "Van Leeuwen", "Dekker", "Brouwer", "De Wit", "Dijkstra", "Smits", "De Graaf", "Van der Meer", "Muller", "Schmidt", "Schneider", "Fischer", "Meyer", "Weber", "Schulz", "Wagner", "Becker", "Hoffmann", "Wagemakers",  "Molenaar", "Jansen", "White", "Bargh", "Dijksterhuis", "Poldermans", "Kanazawa", "Lynne", "Ling", "Vorst", "Borsboom", "Wicherts")

articles = data.frame(cbind(researchers, publications))
write.table(articles, file = "scientific status.txt", sep = " ")

status = read.table("scientific status.txt", header = TRUE, sep = "", quote = "\"'")

I don't think how you create the data, even less the `{write,read}.table` steps are relevant here. It would be a lot more useful if you gave a sample of your data, please refer to http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — flodel, Dec 30 '12 at 12:21
Well, I thought it would be useful to create be able to create the data. — mats, Dec 30 '12 at 12:45
But what are the contents of `status` ? Unless they are integers, you're unlikely to get any matches. Your `researchers` vector has no numbers so `max` is going to do interesting things with those character strings. — Carl Witthoft, Dec 30 '12 at 14:39

agstudy · Answer 1 · 2012-12-30T12:23:20.193

2

It is not a general response but here you need just to extract duplicated.

researchers[duplicated(researchers)]
[1] "Jansen" "White"  ## this 2 authors have 1 publications more than others!

To see the ouliers you can do this for example :

plot(table(researchers))

enter image description here

edited Dec 30 '12 at 12:23

answered Dec 30 '12 at 12:01

agstudy

119,832
17
199
261

flodel · Answer 2 · 2012-12-30T12:18:33.093

2

It is not clear what your data represents. If it is already aggregated per author, i.e., there is one row per author and the publications column contains the number of publications, do:

status$researchers[which.max(status$publications)]

If instead, your data is not aggregated, i.e., there is one per article, you can do:

tail(sort(table(status$researchers)), 1)

edited Dec 30 '12 at 12:18

answered Dec 30 '12 at 12:07

flodel

87,577
21
185
223

Thanks. This helps. And what about the situation where I want to know the name of the researcher who published, say, 30, articles? – mats Dec 30 '12 at 12:19
If your data is already aggregated, `subset(status, publications >= 30)`. If it is not aggregated, `which(table(researchers) >= 30`. – flodel Dec 30 '12 at 12:23

How to index outliers?

2 Answers2