0

I have a dataframe from 840 columns that I read from a .sav file. I convert all columns to factors using data <- haven::as_factor(data)

This is an example, data just after read the file and without convert to factor:

tenureType localityType monthlyRent
1 1 200
1 2 140
1 3 500
2 2 100
1 3 700
2 3 20

After data <- haven::as_factor(data):

tenureType localityType monthlyRent
Full ownership Rural 200
Full ownership Urban 140
Full ownership Camp 500
For free Urban 100
Full ownership Camp 700
For free Camp 20

I have to convert the data to its labels as I want to make some processes on the texts.

I want to build a decision tree using C50 library, so I want to convert all columns that their values (as factor) is a numeric -- like monthlyRent -- to factor of intervals

I want the data to be for example like this:

tenureType localityType monthlyRent
Full ownership Rural 156-292
Full ownership Urban 20-156
Full ownership Camp 428 - 564
For free Urban 20-156
Full ownership Camp 564 - 700
For free Camp 20-156
  • I need each numeric column to be converted to 5 categories
  • The intervals calculated by: ( max - min ) / 5

In the above sample: (700 - 20 ) / 5 = 136.
Intervals are: [20-156], [156-292], [292-428], [428-564], [564-700].

I have 840 columns, so I don't know the columns names, I want the intervals to be dynamically, as such columns ranges are from 0 to 10 and others ranges 0 - 10000.

I want the best approach for this. If there is better approach than intervals calculated by ( max - min ) / 5 I'd like to know.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • how will you select the intervals for each numeric column? i.e. where is the information that suggest 0-210, 210-600, 600-900 is the set of intervals for `monthlyRent`? – langtang Jan 25 '23 at 01:18
  • the intervals is just an example, I dont have an idea how the intervals will be. but I want it to be dynamically. I'm asking for the best approach for this – Obada Jaras Jan 25 '23 at 01:23
  • The sample data don't give any indication at all how to figure interval. You have two "Full ownership Camp" and they have different intervals. What is that based on? – John Polo Jan 25 '23 at 01:27
  • I have edited the question and clarified this point. @langtang – Obada Jaras Jan 25 '23 at 01:36
  • @JohnPolo I have edited the question and clarified this point. – Obada Jaras Jan 25 '23 at 01:37
  • 1
    You can probably do something like this: `library(dplyr); mutate(df, across(where(is.numeric),cut,breaks=5))` – langtang Jan 25 '23 at 01:55

1 Answers1

2

You could use mutate(across()) from the dplyr package, applying cut() with breaks=5 to each of the numeric columns:

mutate(df, across(where(is.numeric),cut,breaks=5))

Output:

      tenureType localityType monthlyRent
1 Full ownership        Rural   (156,292]
2 Full ownership        Urban  (19.3,156]
3 Full ownership         Camp   (428,564]
4       For free        Urban  (19.3,156]
5 Full ownership         Camp   (564,701]
6       For free         Camp  (19.3,156]

Input

df = structure(list(tenureType = c("Full ownership", "Full ownership", 
                              "Full ownership", "For free", "Full ownership", "For free"), 
               localityType = c("Rural", "Urban", "Camp", "Urban", "Camp", 
                                "Camp"), monthlyRent = c(200L, 140L, 500L, 100L, 700L, 20L
                                )), row.names = c(NA, -6L), class = "data.frame")
langtang
  • 22,248
  • 1
  • 12
  • 27