0

I have a large dataset (800,000+ data points) with information about loans given by 5,000+ banks. I am trying to compare the number of loans disbursed by the top N banks that disburse the most loans, with the rest of the banks together. For that, I made the dataframe banks, which is sorted by number of loans disbursed in descending order. I also added a column with the relative cumulative sum of loans disbursed. I was able to make a plot of this, but I am trying to make a histogram where the x axis is the N a number from 1 to 10, and the y axis is the percentage of loans disbursed by the top N banks. Each bar, will be sectioned into different colors. For example, the first bar would be one color and include the cumulative values of the first bank only, the second bar would be the cumulative sum of the top 2 banks, and would have two colors: one for each bank, starting from the top bank.

As a concrete example, let's say I have a set of 100 loans, where the top 5 banks disbursed 20, 14, 12, 12, 10 loans each.

Then the plot should be as follows for N going from 1 to 5: enter image description here

And, if possible, it would have the legends that say which bank corresponds to each color.

I tried using ggplot but it does not let me define the axes the specific way I want them.

I think this is not that hard, but I am a complete neophyte at using R, so I made this histogram using Excel and paint. Thank you so much!

I made the following test data frame as per @sindri_baldur 's suggestion for the example plot using dput():

structure(list(Bank.Name = structure(1:16, .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P"), class = "factor"), Loans = c(20, 14, 12, 12, 10, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -16L))

The Bosco
  • 196
  • 12
  • Create a simple example of your data in R and share it with `dput()` and I'm sure many people will help you. – s_baldur Oct 14 '19 at 08:41
  • 1
    Possible duplicate of [Stacked Bar Plot in R](https://stackoverflow.com/questions/20349929/stacked-bar-plot-in-r) – jay.sf Oct 14 '19 at 08:42
  • @sindri_baldur I added the sample data frame to the post with `dput()` @jay.sf Almost a duplicate, the difference is that that question deals with different groups, whereas I only have one group, each bar corresponds to the top N banks considered from the same dataset. – The Bosco Oct 14 '19 at 09:01
  • This can be solved by the ggplot-based answer at that post. You just don't need to melt the data first since yours is already in the right shape. For us to see why it's *not* a dupe of other similar questions, we'd need to see the code that you say hasn't worked – camille Oct 14 '19 at 16:03

1 Answers1

0

Try following code.

Your data called bnk here.

library(dplyr)
N <- 5
# create empty tibble
top_b <- tibble(topn=0, Bank.Name = '', Loans = 0) %>% 
  filter(topn>0)

for (i in 1:N) {
  top_b <- top_b %>% 
    bind_rows( bind_cols(topn = rep(i, i), head(bnk , i)))

}
# factor with opposite direction needed for graph you want
top_b$Bank.Name  <- factor(top_b$Bank.Name, 
                            levels = unique(top_b$Bank.Name)[N:1])

top_b %>% 
  ggplot(aes(x=topn, y=Loans, fill = Bank.Name))+
  geom_bar(stat = 'identity')
Yuriy Barvinchenko
  • 1,465
  • 1
  • 12
  • 17