0

So I have a medium size database with 113K rows X 14 columns

Month District   Age Gender Education Disability Religion                          Occupation JobSeekers
1 2020-01      Dan   U17   Male      None       None   Jewish              Unprofessional workers          2
2 2020-01      Dan   U17   Male      None       None  Muslims          Sales and costumer service          1
3 2020-01      Dan   U17 Female      None       None    Other                           Undefined          1
4 2020-01      Dan 18-24   Male      None       None   Jewish         Production and construction          1
5 2020-01      Dan 18-24   Male      None       None   Jewish                     Academic degree          1
6 2020-01      Dan 18-24   Male      None       None   Jewish Practical engineers and technicians          1
  GMI ACU NACU NewSeekers NewFiredSeekers
1   0   0    2          0               0
2   0   0    1          0               0
3   0   0    1          0               0
4   0   0    1          0               0
5   0   0    1          0               0
6   0   0    1          1               1

I grouped it to a smaller tables that contain certain data that i need using

Sorta <- datac %>% 
  group_by(District, Month,Gender, Occupation) %>% 
  summarise(JobSeekers=sum(JobSeekers))

The outcome:

  District Month   Gender Occupation                    JobSeekers   GMI   ACU  NACU NewSeekers NewFiredSeekers
  <chr>    <chr>   <chr>  <chr>                              <int> <int> <int> <int>      <int>           <int>
1 Dan      2020-01 Female Academic degree                     4560   120  2622  1818        863             597
2 Dan      2020-01 Female Agriculture, forestry and fi~         14     9     2     3          1               0
3 Dan      2020-01 Female Machine Operators and drivers         57     6    10    41          9               6
4 Dan      2020-01 Female Managers                            1913    36   969   908        390             310
5 Dan      2020-01 Female Officials and clerks                1702   120   263  1319        344             243
6 Dan      2020-01 Female Practical engineers and tech~       2847    66  1125  1656        671             504

than I tried to plot from this table data that should show trends like unemployed numbers by districts, time table showing uneployment growth through time and more Each time and way i tried to do that i get various errors about the character columns so i'm asking for your help plotting characters and numeric values together

Here's the structure:

structure(
  list(
    District = c(
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan",
      "Dan"
    ),
    Month = c(
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01",
      "2020-01"
    ),
    Gender = c(
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Female",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male",
      "Male"
    ),
    Occupation = c(
      "Academic degree",
      "Agriculture, forestry and fishing",
      "Machine Operators and drivers",
      "Managers",
      "Officials and clerks",
      "Practical engineers and technicians",
      "Production and construction",
      "Sales and costumer service",
      "Undefined",
      "Unprofessional workers",
      "Academic degree",
      "Agriculture, forestry and fishing",
      "Machine Operators and drivers",
      "Managers",
      "Officials and clerks",
      "Practical engineers and technicians",
      "Production and construction",
      "Sales and costumer service",
      "Undefined",
      "Unprofessional workers"
    ),
    JobSeekers = c(
      4560L,
      14L,
      57L,
      1913L,
      1702L,
      2847L,
      480L,
      3086L,
      893L,
      1985L,
      2605L,
      44L,
      1276L,
      2236L,
      247L,
      2249L,
      1258L,
      2233L,
      924L,
      2462L
    ),
    GMI = c(
      120L,
      9L,
      6L,
      36L,
      120L,
      66L,
      47L,
      396L,
      155L,
      998L,
      119L,
      26L,
      240L,
      101L,
      30L,
      111L,
      322L,
      359L,
      309L,
      1124L
    ),
    ACU = c(
      2622L,
      2L,
      10L,
      969L,
      263L,
      1125L,
      99L,
      392L,
      259L,
      52L,
      1549L,
      1L,
      49L,
      797L,
      44L,
      829L,
      102L,
      202L,
      124L,
      58L
    ),
    NACU = c(
      1818L,
      3L,
      41L,
      908L,
      1319L,
      1656L,
      334L,
      2298L,
      479L,
      935L,
      937L,
      17L,
      987L,
      1338L,
      173L,
      1309L,
      834L,
      1672L,
      491L,
      1280L
    ),
    NewSeekers = c(
      863L,
      1L,
      9L,
      390L,
      344L,
      671L,
      83L,
      622L,
      201L,
      325L,
      550L,
      5L,
      239L,
      469L,
      53L,
      525L,
      233L,
      432L,
      212L,
      324L
    ),
    NewFiredSeekers = c(
      597L,
      0L,
      6L,
      310L,
      243L,
      504L,
      60L,
      375L,
      123L,
      150L,
      447L,
      4L,
      196L,
      405L,
      41L,
      429L,
      162L,
      316L,
      124L,
      190L
    )
  ),
  row.names = c(NA,-20L),
  class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
  groups = structure(
    list(
      District = c("Dan", "Dan"),
      Month = c("2020-01", "2020-01"),
      Gender = c("Female", "Male"),
      .rows = list(1:10, 11:20)
    ),
    row.names = c(NA,-2L),
    class = c("tbl_df", "tbl", "data.frame"),
    .drop = TRUE
  )
)

2nd ques is about how i can make a map of 'hotspot' areas of unemployed people / occupations / ages

please help!

Update:

dist.oc.mo <- Cdata %>% 
  group_by(District,Gender,Occupation,Month) %>% 
  summarise(JobSeekers=sum(JobSeekers),GMI=sum(GMI), ACU=sum(ACU), NACU=sum(NACU), NewSeekers=sum(NewSeekers), NewFiredSeekers=sum(NewFiredSeekers))


p <- ggplot(data = dist.oc.mo) +
  geom_bar(mapping = aes(x = Occupation, y = JobSeekers, fill=factor(District)), 
           stat = "identity", position = "dodge", alpha=0.7 ) + 
  labs(title = "March-April Jobseekers", subtitle = "This barchart describes unemployment trend for March and April sorted by jobseekers number and occupation type", fill = "District", 
       x = "Occupation", y = "JobSeekers") +
  scale_x_discrete(labels = wrap_format(10)) +
  scale_fill_brewer(palette="Set1") +
  theme(legend.position = "bottom")
p

[https://i.stack.imgur.com/v0R0V.jpg][1]

Regards, Moshe

Moshep
  • 19
  • 6
  • It would be helpful if you can provide a reprex to replicate your issue. Please see the link: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/16532098 – YBS Jun 13 '20 at 14:08
  • Hi @YBS thanks for your attention. I paste 1:20 from the table, which by the way presented as tibble if it makes any difference. Hope you can help me with it and thanks again! – Moshep Jun 13 '20 at 15:35
  • What type of plot are you looking for? Would bar chart suffice or you have something specific in mind? – YBS Jun 13 '20 at 19:41
  • I need one example of bar chart and “hotspots map” that shows districts with large amount of unemployed people. Much appreciated for your help! – Moshep Jun 13 '20 at 21:38

1 Answers1

0

Consider your data as df. Then I have added some dummy districts named Bob and John. Also, I have only considered the first 5 occupations for this example. A bar chart code is given below:

myoccupation = c(
    "Academic degree",
    "Agriculture, forestry and fishing",
    "Machine Operators and drivers",
    "Managers",
    "Officials and clerks")

  df1 <- mutate(df, District="Bob", JobSeekers=(JobSeekers+50*row_number()*row_number()),
                GMI=(GMI-row_number()), ACU=(ACU+row_number()), NACU=(NACU-row_number()),
                NewSeekers=(NewSeekers+row_number()), NewFiredSeekers=(NewFiredSeekers+row_number()))
  df2 <- mutate(df, District="John", JobSeekers=(JobSeekers+88*row_number()),
                GMI=(GMI-row_number()+25), ACU=(ACU+5*row_number()-1), NACU=(NACU-row_number()+1),
                NewSeekers=(NewSeekers+row_number()-2), NewFiredSeekers=(NewFiredSeekers+row_number()-3))         

  df3 <- rbind(df,df1,df2)
  df4 <- df3[df3$Occupation %in% myoccupation,]

  p <- ggplot(data = df4) +
    geom_bar(mapping = aes(x = Occupation, y = JobSeekers, fill=factor(District)), 
             stat = "identity", position = "dodge", alpha=0.7 ) + 
    labs(title = "Bar Chart", fill = "District", 
         x = "Occupation", y = "JobSeekers") +
    scale_x_discrete(labels = wrap_format(10)) +
    scale_fill_brewer(palette="Set1") +
    theme(legend.position = "bottom")
  p

You will get the following output:

output

Please note that both Male and Female bars are on top of each other in this plot. Darker shade is the lower of the two values. You can plot them separately.

For hotspots you need to do some research on density plots and more. You need to use the raw data, not summarized data for it. A sample 2d density plot is give below:

dfa <- tibble(x_variable = rnorm(5000), y_variable = rnorm(5000))
  p2d <- ggplot(dfa, aes(x = x_variable, y = y_variable)) +
    stat_density2d(aes(fill = ..density..), contour = F, geom = 'tile') +
    scale_fill_viridis()
  p2d

Please note that in SO we can only answer any issues within your r code.

Update: Subset data to include only Female as

df5 <- subset(df4, Gender=="Female")

Then using df5 in the above ggplot code you get the following output:

Output2

Please note that I am using manual assignment of colors as scale_fill_manual(values=c("blue","green","purple")), since I know that there are 3 districts in my data.

YBS
  • 19,324
  • 2
  • 9
  • 27
  • Hi again @YBS, I get an error trying to run this code saying wrap_format function is unknown. "Error in wrap_format(10) : could not find function "wrap_format" ". 2 more questions if you don't mind: 1. How can i use it generally to do statistics on other values? (like x = district , y = new seekers). I understood everything except the row_number() func and the numbers 50,88 and so on 2. what is District bar (the red one) actually represented? Sorry if my questions are stupid.I'm a student trying to do above average project in data science course. Many thanks for your attention and work! – Moshep Jun 14 '20 at 11:29
  • I have created dummy data with districts Bob and John. To create the dummy I used `row_number()` to change the values of some of the variables. `row_number()` represents row number in your data frame. In your plot you should use your dataframe and not `df4`. Then you will have the districts that are present in your real data. If you have only one district, that should be fine too. You may be missing some packages to get that error. Please install package named tidyr. – YBS Jun 14 '20 at 12:25
  • Hi @YBS, hope you'll remember the case. So i managed to use your plot as described but i can't get rid of the red District bar. It doesn't represent anything actually. How can i do it? and one more thing - how the darker and brighter on top of each other if there isn't a mention of the gender column through the code? Let me thank you again for your help. Accept this issue there's nothing wrong with the method you provide and I'm thankful for that! – Moshep Jun 16 '20 at 13:55
  • If red is the first color in the defaut list, whatever district you have will get that red color. If you want a different color you can specify which color you want. As District is the fill factor, you can specify one color or as many colors as the number of districts. Just add `scale_fill_manual(values=c("blue", "green")) ## this is for two colors, if you have two districts` – YBS Jun 16 '20 at 16:44
  • i've added an example from the code and a visual so you can see what i'm trying to explain. There's another category called "District", and it's seems like it's calculate the sum of the other occupations and show them. I want to get rid of this value and leave the rest exactly as it is. – Moshep Jun 16 '20 at 20:05
  • District is the title for your legends. If you do not want the title, please replace `fill="District"` to `fill=NULL` in `labs()`. You can accept the answer, if I have answered your initial question. – YBS Jun 16 '20 at 20:32
  • That worked. Thanks so much for your attention. I accepted your answer already 2 days ago. – Moshep Jun 16 '20 at 21:41