0

I have a dataframe called 'Cdata' It has multiple columns with multiple strings in each one of them I'm looking for a way to choose from each columns the expression I need an then sum up the numbers from the last one which contains numbers.

Here's the excel file:

Database excel screenshot

i'm looking for a way to sort and sum it up. Example:

Select Cdata$Gender = "Female" & Cdata$Month = "2020-01" & Cdata$District = "North" & Cdata$Age = '25-34' &
Cdata$Religion = 'Christian' 

and sum column Jobseekers for all these values. Then I need to plot this data and show the difference between christian unemployed females from the north and south, or the difference between march and april and do statistic tests.

here's an example of outcome:

Month   District    Age    Gender  Religion    Occupation             JobSeekers
2020-01 North      25-34   Female  Christian   Unprofessional workers    3258

I tried to explain it with minimum lines so it'll be informative and directly instead of long and clumsy. Please consider me as a newbie here and be merciful if I made any mistakes.

Here's the dput for structure:

structure(
  list(
    Month = c(
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01"
    ),
    District = c("Dan", "Dan", "Dan", "Dan",
                 "Dan", "Dan"),
    Age = c("U17", "U17", "U17", "18-24", "18-24",
            "18-24"),
    Gender = c("Male", "Male", "Female", "Male", "Male",
               "Male"),
    Education = c("None", "None", "None", "None", "None",
                  "None"),
    Disability = c("None", "None", "None", "None", "None",
                   "None"),
    Religion = c("Jewish", "Muslims", "Other", "Jewish",
                 "Jewish", "Jewish"),
    Occupation = c(
      "Unprofessional workers",
      "Sales and costumer service",
      "Undefined",
      "Production and construction",
      "Academic degree",
      "Practical engineers and technicians"
    ),
    JobSeekers = c(2L,
                   1L, 1L, 1L, 1L, 1L),
    GMI = c(0L, 0L, 0L, 0L, 0L, 0L),
    ACU = c(0L,
            0L, 0L, 0L, 0L, 0L),
    NACU = c(2L, 1L, 1L, 1L, 1L, 1L),
    NewSeekers = c(0L,
                   0L, 0L, 0L, 0L, 1L),
    NewFiredSeekers = c(0L, 0L, 0L, 0L, 0L,
                        1L)
  ),
  row.names = c(NA, 6L),
  class = "data.frame"
)
Moshep
  • 19
  • 6
  • 1
    Welsome to SO, Moshep. It will be eassier to help us if you provide a simple self-contained example explaining what you want. [This post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) explains what I need. Here, we need the output from `dput(Cdata)` or `dput(head(Cdata))` so what we have access to your input. But the generic answer to your question is `filter` to selecte the data you want, `group_by` to greate the groupings you want and `summarise` to create the summaries you want. There any many previous answers and tutorials. – Limey Jun 10 '20 at 07:12
  • Hi @Limey, many thanks for your attention. So i edited the original post and add the dput as requested. I used filter, group by and summarise functions but i didn't succeed in any of those afforts. For some reason it won't get all the conditions together and when i run i get errors \ inaccurate numbers. I would be so greatful if you could show me an example of syntax that contain all of these conditions together. Also, if there are any similar posts I would love to get a link. Thanks again for your help, much appreciate! – Moshep Jun 10 '20 at 10:20

1 Answers1

0

Thank you, Moshep. I've put your toy data into a tibble called data. Having done that, I can filter to get the subset you're interested in like this:

library(tidyverse)

data <- as_tibble(data)
data %>% filter(
           Gender == "Female",
           Month == "2020-01",
           District == "Dan",
           Age == '25-34',
           Religion == 'Christian' 
         ) 

Unfortunately, based on your toy data, that gives me an empty dataset:

# A tibble: 0 x 14
# … with 14 variables: Month <chr>, District <chr>, Age <chr>, Gender <chr>, Education <chr>, Disability <chr>, Religion <chr>, Occupation <chr>,
#   JobSeekers <int>, GMI <int>, ACU <int>, NACU <int>, NewSeekers <int>, NewFiredSeekers <int>

But in principle I could then group and summarise the data as you wish. Based on all the toy data, this would be:

data %>% 
  group_by(District) %>% 
  summarise(JobSeekers=sum(JobSeekers))

Giving

# A tibble: 1 x 2
  District JobSeekers
  <chr>         <int>
1 Dan               7

I'll leave you to have a go at doing the tests and creating the plots.

Limey
  • 10,234
  • 2
  • 12
  • 32
  • Thanks for your help! I think i understood why the sum isn't accurate. I need to create another condition based onthe int columns, cause each column describes sector and there may be doubles values. I'll try to explain: I have Jobseekers, ACU, NACU, GMI and so on It works like some kind of matrix. If we have 5 Jobseekers than they may be 1 ACU, 2 NACU and 2 GMI So when i try summing Female & JobSeekers i get 3M while the real number is 500K I tried to use these filters but than i get the exact same dataframe without any filter. Could be a damaged csv or just my commands isn't correct? – Moshep Jun 10 '20 at 11:28
  • I'd suggest working with a small set of data to start with. Complex enough that it has the features you need to test, simple enough that you can check the results by hand. Once your happy your code is correct, then run it on the "real" data. Good luck! (And I'm happy to have helped.) – Limey Jun 10 '20 at 12:11