0

I've a dataframe with state level info, and I need to be able to extract that info and store it in a better format.

Goal: For each key, the total weight should be 100. If not, add the state as 'Other' with remaining weight. Some states may not have any weight, in such cases, it should be replaced with 100.

enter image description here

And the output required is

enter image description here

structure(list(Key = c("ATS", "AHP", "ATR", "BWBR",  "BAMT", "BMPR"),
STATE = c("TX-22.41% GA-19.70% DC-11.97% MA-10.52% ...",  "SC-13.46%
TX-13.37% NC-10.31% IA-9.54% ...", "TX-22.26% AR-16.47% MO-12.43%
NE-11.45% ...",  "NJ", "TN-86.25% NJ-13.75%", "MA")), row.names =
c(NA, 6L), class = "data.frame")

Wrote the following code and it works. But is there a better way of handling the same?

allStateInfo = NULL
for (n in 1:length(maindt$t)) {
    thisRecord = maindt[n,]
    print(thisRecord$STATE)
    sVec = unlist(strsplit(x = thisRecord$STATE, split = "% ", fixed = T))
    totalWeight = 0
    # If there is any data in State Name
    if (length(sVec) > 0) {
        for (z in 1:length(sVec)) {
            # Split with String by identifying the percentage with Ex: Tx-77.88%
            result = regexpr("[.][0-9]", sVec[z], perl = TRUE)
            for (i in 1:length(result)) {
                if (!is.na(result[i])) {
                    if (result[i] > 0) {
                        # Identify with - digit
                        position = regexpr("[-][0-9]", sVec[z], perl = TRUE)
                        state = substr(sVec[z], 1, position - 1)
                        wgt = substr(sVec[z], position + 1, 1000000L)
                    }
                    else {
                        # Get the state alone
                        state = sVec[z]
                        wgt = "0"
                    }
                } else {
                    state = sVec[z]
                    wgt = "0"
                }
                # IF the state is ...., replace it with other
                state = gsub(pattern = '...', replacement = 'Other', x = state, fixed = T)
                wgt = gsub(pattern = '%', replacement = '', x = wgt, fixed = T)
                wgt = as.numeric(wgt) / 100
                # If the weight is 0, then get the balance of remaining weight in the state
                if (wgt > 0) {
                    totalWeight = totalWeight + wgt
                } else {
                    wgt = 1 - totalWeight
                    totalWeight = totalWeight + wgt
                }
                pDf = data.frame(key = maind$key, tstateInd = z, final_state = state, weight = wgt)
                allStateInfo = rbind(allPropDf, pDf)
            }
        }
    } else {
        state = 'Other'
        wgt = 1
        totalWeight = totalWeight + wgt
        pDf = data.frame(key = maind$key, tstateInd = z, final_state = state, weight = wgt)
        allStateInfo = rbind(allPropDf, pDf)
    }

    if (totalWeight < 1) {
        wgt = 1 - totalWeight
        state = 'Other'
        z = length(sVec) + 1
        pDf = data.frame(key = maind$key, tstateInd = z, final_state = state, weight = wgt)
        allStateInfo = rbind(allPropDf, pDf)
    }
}
nsivakr
  • 1,565
  • 2
  • 25
  • 46
  • please add `dput(maindt)` to your question to make it reproducible. `read.table(text = gsub('[ %]', '\n', maindt$STATE), sep = '-', fill = TRUE)` gets you most of the way there I think – rawr Dec 15 '20 at 21:15
  • Please [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including example data in a plain text format - for example the output from `dput(yourdata)`. We cannot copy/paste data from images. – neilfws Dec 15 '20 at 21:15
  • Added dput to the above code. – nsivakr Dec 15 '20 at 21:26
  • @rawr, aren't you simply replacing % with new line. It is not that straight forward. – nsivakr Dec 15 '20 at 21:29
  • just said it does most of the work, [did you try it?](https://i.imgur.com/Zhbvy3K.png) all yo need to do is add the key and "other," I suppose you are doing other = 100 - sum of others from the same key? – rawr Dec 15 '20 at 21:37
  • @rawr, you are doing other = 100 - sum of others from the same key? Yes. Will try. – nsivakr Dec 15 '20 at 21:52

1 Answers1

1

You can use str_split to separate states, then unnest_wider and pivot_longer to get in the shape you need. Then separate state names from their values.

Update Per OP comments, Key value groups need to sum to 100.

library(tidyverse)

df %>%
  mutate(state_data = str_split(STATE, '%\\s')) %>%
  unnest_wider(state_data, names_sep = "_") %>%
  pivot_longer(starts_with('state_data')) %>%
  separate(value, into = c('state_abbrv', 'value'), sep = '-') %>%
  mutate(across(value, ~str_replace(., '%', '') %>% as.numeric),
         across(state_abbrv, str_replace, '...', 'Other')) %>%
  group_by(Key) %>%
  mutate(val_total = sum(value, na.rm = TRUE),
         value = if_else(state_abbrv == "Other", 100 - val_total, value),
         value = if_else(is.na(value), 100, value)) %>%
  filter(!state_abbrv == "<NA>") %>%
  select(Key, state = state_abbrv, value)

Output:

   Key   state  value
   <chr> <chr>  <dbl>
 1 ATS   TX     22.4 
 2 ATS   GA     19.7 
 3 ATS   DC     12.0 
 4 ATS   MA     10.5 
 5 ATS   Other  35.4 
 6 AHP   SC     13.5 
 7 AHP   TX     13.4 
 8 AHP   NC     10.3 
 9 AHP   IA      9.54
10 AHP   Other  53.3 
11 ATR   TX     22.3 
12 ATR   AR     16.5 
13 ATR   MO     12.4 
14 ATR   NE     11.4 
15 ATR   Other  37.4 
16 BWBR  NJ    100   
17 BAMT  TN     86.2 
18 BAMT  NJ     13.8 
19 BMPR  MA    100   
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • there are two problems. a ) If the sum of weights (distribution) does not add up to 100, then I should be able to add 'Other' for the remaining. Please look at the example in the table. b) If states are indicated as ..., should be replaced with 'Other ' and add the weight accordingly. – nsivakr Dec 15 '20 at 21:50
  • please edit your post to describe in words what you're looking for, including these notes. this solution gets you most of the way there. i can make edits but it'd be helpful to have these edge cases spelled out clearly, not just in code. that will also be useful to others coming to learn from your question in the future. – andrew_reece Dec 15 '20 at 21:55
  • see updates, the remaining work was just groupby and filtering. – andrew_reece Dec 16 '20 at 01:09