3

After performing a survey on perceived problems per neighborhood I get this dataframe. Since the survey had different options to choose from + an open one, the results on the open question are frequently irrelevant (see below):

library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")

# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")

df = df %>%
  group_by(Problems) %>%
  summarise(Total = n()) %>%
  mutate(freq = Total/sum(Total)*100) %>%
  arrange(rank = desc(rank(freq)))

Resulting in this data frame:

> df
Source: local data table [34 x 3]

                       Problems Total       freq
1  Hurtos o robos sin violencia   245 25.6008359
2                        Drogas   232 24.2424242
3             Peleas callejeras   162 16.9278997
4               Ningún problema   149 15.5694880
5                    Agresiones    66  6.8965517
6           Robos con violencia    62  6.4785789
7            Quema contenedores     6  0.6269592
8                        Ruidos     5  0.5224660
9                         NS/NC     4  0.4179728
10                    Desempleo     2  0.2089864
..                          ...   ...        ...
>

As you can see results after row 9 are mostly irrelevant (only one or two respondants per option), so I'd like them to be grouped into a single option (such as "others") without losing their relation to the neighborhood (that's why I cant rename the values now). Any suggestions?

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
ccamara
  • 1,141
  • 1
  • 12
  • 32
  • So did you decide what is your desired output yet? – David Arenburg Jul 26 '15 at 15:47
  • My desired output would be to display a barplot of problems per neighborhoods. However, being the source an open questionnaire there are a lot of irrelevant answers (with just few votes) which I would like to aggregate into "Others" (but maintaining its relationship to the neighborhood) and other problems that are in fact synonyms. I have updated info and broadened question here: http://stackoverflow.com/questions/35813805/aggregating-and-mapping-observations-from-an-open-questionnaire – ccamara Mar 06 '16 at 16:50

1 Answers1

6

The splitstackshape imports the data.table package (so you don't even need to library it) and assigns a data.table class to your data set, so I would simply proceed with data.table syntax from there, especially because nothing beats data.table when it comes to assignments in a subset.

In other words, intead of this long dplyr piping, you can simply do

df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]

And you good to go.

You can check the new summary table using

df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2:                       Drogas 24.242424
# 3:            Peleas callejeras 16.927900
# 4:              Ningֳ÷n problema 15.569488
# 5:                   Agresiones  6.896552
# 6:          Robos con violencia  6.478579
# 7:                        OTHER  4.284222
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for your answer, David. Although I could not reproduce it, I'm afraid that whay you're suggesting is another way to achieve what I already have, which is not why I really want. By grouping all irrelevant values into "Others" I do not know how to relate answers to neighborhoods again. – ccamara Jul 24 '15 at 14:40
  • So intead of overriding the `Problems` column assigm to a new one instead, say `Aggs` or something using exactly the same code as above. Other than that I have no idea what you want. Maybe add your desired output. – David Arenburg Jul 24 '15 at 14:55
  • Also, I did nothing like you already did. Ive assigned new values according to your condition and updated your data by reference. You just calculated frequncies. – David Arenburg Jul 24 '15 at 15:35