I'm attempting to use regex within str_detect() to classify some basic Industry SIC codes to their 'higher' classification, and I'm having a few different issues. Example below of the data:
SIC_Code Description
8001 Healthcare (INDUSTRY SEGMENT)
3823 Process Control Instruments
3823 Process Control Instruments
3823 Process Control Instruments
9901 Undefined and Other (INDUSTRY SEGMENT)
2899 Chemical Preparations, Nec
The above table has the individual SIC codes and their description. What I'm trying to do, is then classify them to their parent group, which is defined by the first 2 digits of the Code. I have another table with the parent groups that looks like this (note that I shortened the table to include only the groupings in this example, the full table can be found here https://www.naics.com/business-lists/counts-by-sic-code/#countsBySIC):
Parent Description
20 Manufacturing
.
.
.
39 Manufacturing
80 Services
99 Public Administration
I'm attempting to use str_detect to detect the parent in the first 2 digits of the SIC_Code column, and then create a new column in the original table that has the description of the Parent. So for example all the 3 rows with 3823, and the row with 2899 will all be classified into "Manufacturing", while the other two will be classified into their respective parent.
The code that I've tried is:
test <- test %>%
mutate(Parent_Desc = ifelse(str_detect(`SIC_Code`, '^[01-09]'), 'Agriculture, Forestry, and Fishing',
ifelse(str_detect(`SIC_Code`, '^[10-14]'), 'Mining',
ifelse(str_detect(`SIC_Code`, '^[15-17]'), 'Construction',
ifelse(str_detect(`SIC_Code`, '^[20-39]'), 'Manufacturing',
ifelse(str_detect(`SIC_Code`, '^[40-49]'), 'Transportation,
Communications, Electric, Gas, and Sanitary Services',
ifelse(str_detect(`SIC_Code`, '^[50-51]'), 'Wholesale Trade',
ifelse(str_detect(`SIC_Code`, '^[52-59]'), 'Retail Trade',
ifelse(str_detect(`SIC_Code`, '^[60-67]'), 'Finance,
Insurance, and Real Estate',
ifelse(str_detect(`SIC_Code`, '^[70-89]'), 'Services',
ifelse(str_detect(`SIC_Code`, '^[90-99]'), 'Public Administration',
'Other')))))))))))
I'm getting 2 different issues with this. The first issue appears to happen when I try ^[01-09]. The error that I get is:
Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)) :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
The second issue that I see, is if I try ^[20-39], it incorrectly labels some rows. For example, it marks the SIC_Code 9901 as being TRUE, meaning it falls within the 20-39 range. This obviously isn't true.
Can anyone point me in the direction of fixing the issues I'm seeing? I've tried this with the SIC_Code column as both numeric and character, and the results have been the same. Any assistance would be greatly appreciated.