-1

I am new to R programming and most of my experience thus far is with using highly structured rectangular data from a .csv or .xlsx. But now I've been handed about 30 spreadsheets of budget data that look like this:

Data starts like this

And in order to work with them, I'd like to get them into a more friendly format (not exactly tidy b/c of the Q1 to Q4 could/should be a single variable -- but I can fix that later with pivot_longer), like this:

Better, more useful format

Searching SO, the closest problem/solution I found was this: R importing semi-unstructured data CSV, but that example contains a series of structured tables that do not require the modification mine does, plus, it is a text file converting to character vectors, and I have Excel workbooks with multiple worksheets (I only need 1 of the sheets).

Here's what I've tried so far:

library(tidyverse)
library(xlsx)

# Straight read in the worksheet as-is
df <- read_xlsx(path = "filename.xlsx", sheet = "worksheet", col_names = FALSE)

# Get the location name into its own column, then delete row 1 since it's not needed
df <- df %>%
    mutate(location = df[[1,1]])
df <- df[-c(1),]

# Add a column and initialize it to "empty"
df <- df %>% 
  add_column(budget_type = "empty")
  

# Now loop through the dataframe in Column 1, search for the keyword(s) and place
# them in the last "budget_type" column
for (row in 1:nrow(df)){
  print(df[[row,1]])
  
  if (df[[row,1]] %in% c("Baseline","Scope Changes")){
    budget_type <- df[[row,1]]
  }
  
  if (!is.na(df[[row,1]])){
    if (str_detect(df[[row,1]], "[0-9]{4}") == TRUE){
      df[[row, "budget_type"]] <- budget_type
    }
  }
}

#  ...and from here I could write another loop going from bottom to top seeking
# the categories and placing them in another created column, and finally delete the rows
# that are empty, total rows, or unnecessary header rows.

My question is: Is there an obviously better way to do this in R in place of the loops and the general approach I'm describing to get my data in a tidy format?

Thank you in advance.

EDIT 6/7/2021:

It appears I cannot attach the Excel file, but if I'm following the "minimal reproducible example" guidelines correctly, here is the unprocessed data from dput() after reading in the data from Excel:

structure(list(...1 = c("Mehoopany", NA, "CLASS CODE", "Baseline", NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA, NA, "5710", "8224", "8235", NA, NA, NA, NA, "CLASS CODE", "Scope Changes", NA, "0201", "0300", "0301", NA, NA, "5500", "8245", "8260", NA, NA, "5710", "8224", "8235", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ...2 = c(NA, NA, "Classification", NA, NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup", "Specialized Cleaning", "Window Cleaning", "Cleaning", NA, "Gen Office Exp", "Wellness / Fitness Centers", "Meeting Rooms", "Employee Convenience", NA, "Reception", "Photographic Services", "Mail Room Services", "Mail Room & Reception", NA, "Total", NA, "Classification", NA, NA, "Cleaning, Spills, and Trash / Recycle Bin Pickup", "Specialized Cleaning", "Window Cleaning", "Cleaning", NA, "Gen Office Exp", "Wellness / Fitness Centers", "Meeting Rooms", "Employee Convenience", NA, "Reception", "Photographic Services", "Mail Room Services", "Mail Room & Reception", NA, "Total", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c(NA, NA, "FY2021 Phasing", NA, "Q1", "1205", "0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, "FY2021 Phasing", NA, "Q1", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...4 = c(NA, NA, NA, NA, "Q2", "1205", "0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA, NA, "Q2", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...5 = c(NA, NA, NA, NA, "Q3", "1205", "0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA, NA, "Q3", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...6 = c(NA, NA, NA, NA, "Q4", "1205", "0", "0", "1205", NA, "0", "0", "174", "174", NA, "0", "0", "1453.625", "1453.625", NA, "2832.625", NA, NA, NA, "Q4", "25", "0", "0", "25", NA, "0", "0", "37", "37", NA, "0", "17", "0", "17", NA, "79", NA, NA, NA, NA, NA, NA, NA, NA, NA), ...7 = c(NA, NA, NA, NA, "Total", "4820", "0", "0", "4820", NA, "0", "0", "696", "696", NA, "0", "0", "5814.5", "5814.5", NA, "11330.5", NA, NA, NA, "Total", "100", "0", "0", "100", NA, "0", "0", "148", "148", NA, "0", "68", "0", "68", NA, "316", NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

ScottyJ
  • 945
  • 11
  • 16
  • 1
    Every dataset is "special" in it's own way. There are no good "general approaches" that always work. If you want help in this particular case, it's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Do not post data as images. We can't copy/paste that into R for testing. If you want to get feedback on working code, use [codereview.se], otherwise make your sspecificprogramming question more clear. – MrFlick Jun 04 '21 at 19:28
  • 2
    Sadly there is no "good and general approach". Some packages allow you to specify the region to read, and you can use this to your advantage. A few allows you to interact with the cells themselves before reading the data. But it is my opinion that it is often easier to read the entire sheet and just work in R. – Oliver Jun 04 '21 at 19:29
  • Thank you, @MrFlick. I will check out Code Review, as that sounds perhaps more applicable. Obviously I can get through conversion with the rough code in my question, but I am just looking for better way to do the same thing I'm trying already as it seems cumbersome and a more experienced programmer would likely do it better. I appreciate the comments. – ScottyJ Jun 04 '21 at 19:42
  • 1
    The `unheadr` package may be helpful here too. – Jon Spring Jun 04 '21 at 20:29
  • Or at least edit the question so it's not "does my code look ok", and more this is the problem i'm trying to solve (again, having a reproducible example is very helpful). You can post your own solution to the problem below as an answer (there's nothing wrong with answering your own question here). Then people can either upvote it if they like it, or someone else can provide an alternative solution. That's a better fit for Stack Overflow. – MrFlick Jun 04 '21 at 20:34
  • 1
    Jenny Bryan and others have been doing a lot of work on spreadsheet wrangling: https://rfortherestofus.com/2019/12/how-to-clean-messy-data-in-r/ lists `tidycells` and `unpivotr` in addition to `unheadr` – Ben Bolker Jun 04 '21 at 20:37

1 Answers1

0

Here is the script I used -- it works -- with explanatory comments:

library(tidyverse)
library(xlsx)

file <- "C:/Path/To/Book1.xlsx"

names <- c("class_code", "classification", "Q1", "Q2", "Q3", "Q4",
           "Total", "Location", "Budget_Type", "Category")


# Read in the file, setting range to restrict columns ingested as some 
# scrap work exists in some files beyond column G; row 50 is well beyond 
# expected data range.
dframe <- read_xlsx(path = file, range = "'Original Data'!A1:G50", 
                col_names = FALSE)

# Output above data so it can be included in Stack Overflow as a 'minimal
#  reproducible example'.
dput(dframe)

# Move the location name ('Mehoopany') to it own column, then delete the row.
dframe <- dframe %>% 
  mutate(location = dframe[[1,1]])
dframe <- dframe[-c(1),]

# Add/define two additional columns which will be used in loops below.
dframe <- dframe %>% 
  add_column(budget_type = "empty", category = "empty")
  
# Loop 1:  Move *DOWN* the data set, labeling every line with 'CLASS CODE' as
# being either "Baseline" or "Scope Changes" in 'budget_type' column.
for (row in 1:nrow(dframe)){
  
  if (dframe[[row,1]] %in% c("Baseline","Scope Changes")){
    budget_type <- dframe[[row,1]]
  }
  
  if (!is.na(dframe[[row,1]])){
    if (str_detect(dframe[[row,1]], "[0-9]{4}") == TRUE){
      dframe[[row, "budget_type"]] <- budget_type
    }
  }
}

# Loop 2: Move *UP* the data set, labeling every line with 'CLASS CODE' with 
# it's respective roll-up category, and otherwise delete the line.
for (row in nrow(dframe):1){
  if ( dframe[[row,2]] == "Total" || 
       is.na(dframe[[row,2]]) || 
       dframe[[row, 2]] == "Classification" ) {
    # delete rows where the 2nd column is <blank>, 'Classification', or 'Total'. 
    dframe <- dframe[-row,]
  } else {    
      if ( !is.na(dframe[[row,2]]) && is.na(dframe[[row,1]]) ){
        # if row no 'CLASS CODE' but has value in 2nd column, assign value to
        # category then delete the row entirely.
        category <- dframe[[row,2]]
        dframe <- dframe[-row,]
        
      } else if ( str_detect(dframe[[row,1]], "[:digit:]{4}") ){
        # if row has 'CLASS CODE', then label the category column with the
        # stored value.
        dframe[[row, "category"]] <- category
      }
    }
}

# Assign the names from the character vector set at the beginning.
names(dframe) <- names

# Print out the resulting dataframe.
dframe
E_net4
  • 27,810
  • 13
  • 101
  • 139
ScottyJ
  • 945
  • 11
  • 16