How to convert all numeric columns to intervals in R

Question

I have a dataframe from 840 columns that I read from a .sav file. I convert all columns to factors using data <- haven::as_factor(data)

This is an example, data just after read the file and without convert to factor:

tenureType	localityType	monthlyRent
1	1	200
1	2	140
1	3	500
2	2	100
1	3	700
2	3	20

After data <- haven::as_factor(data):

tenureType	localityType	monthlyRent
Full ownership	Rural	200
Full ownership	Urban	140
Full ownership	Camp	500
For free	Urban	100
Full ownership	Camp	700
For free	Camp	20

I have to convert the data to its labels as I want to make some processes on the texts.

I want to build a decision tree using C50 library, so I want to convert all columns that their values (as factor) is a numeric -- like monthlyRent -- to factor of intervals

I want the data to be for example like this:

tenureType	localityType	monthlyRent
Full ownership	Rural	156-292
Full ownership	Urban	20-156
Full ownership	Camp	428 - 564
For free	Urban	20-156
Full ownership	Camp	564 - 700
For free	Camp	20-156

I need each numeric column to be converted to 5 categories
The intervals calculated by: ( max - min ) / 5

In the above sample: (700 - 20 ) / 5 = 136.
Intervals are: [20-156], [156-292], [292-428], [428-564], [564-700].

I have 840 columns, so I don't know the columns names, I want the intervals to be dynamically, as such columns ranges are from 0 to 10 and others ranges 0 - 10000.

I want the best approach for this. If there is better approach than intervals calculated by ( max - min ) / 5 I'd like to know.

how will you select the intervals for each numeric column? i.e. where is the information that suggest 0-210, 210-600, 600-900 is the set of intervals for `monthlyRent`? — langtang, Jan 25 '23 at 01:18
the intervals is just an example, I dont have an idea how the intervals will be. but I want it to be dynamically. I'm asking for the best approach for this — Obada Jaras, Jan 25 '23 at 01:23
The sample data don't give any indication at all how to figure interval. You have two "Full ownership Camp" and they have different intervals. What is that based on? — John Polo, Jan 25 '23 at 01:27
I have edited the question and clarified this point. @langtang — Obada Jaras, Jan 25 '23 at 01:36
@JohnPolo I have edited the question and clarified this point. — Obada Jaras, Jan 25 '23 at 01:37
You can probably do something like this: `library(dplyr); mutate(df, across(where(is.numeric),cut,breaks=5))` — langtang, Jan 25 '23 at 01:55

score 2 · Answer 1 · answered Jan 25 '23 at 02:01

You could use mutate(across()) from the dplyr package, applying cut() with breaks=5 to each of the numeric columns:

mutate(df, across(where(is.numeric),cut,breaks=5))

Output:

      tenureType localityType monthlyRent
1 Full ownership        Rural   (156,292]
2 Full ownership        Urban  (19.3,156]
3 Full ownership         Camp   (428,564]
4       For free        Urban  (19.3,156]
5 Full ownership         Camp   (564,701]
6       For free         Camp  (19.3,156]

Input

df = structure(list(tenureType = c("Full ownership", "Full ownership", 
                              "Full ownership", "For free", "Full ownership", "For free"), 
               localityType = c("Rural", "Urban", "Camp", "Urban", "Camp", 
                                "Camp"), monthlyRent = c(200L, 140L, 500L, 100L, 700L, 20L
                                )), row.names = c(NA, -6L), class = "data.frame")

How to convert all numeric columns to intervals in R

1 Answers1