I have a dataframe from 840 columns that I read from a .sav file. I convert all columns to factors using data <- haven::as_factor(data)
This is an example, data just after read the file and without convert to factor:
tenureType | localityType | monthlyRent |
---|---|---|
1 | 1 | 200 |
1 | 2 | 140 |
1 | 3 | 500 |
2 | 2 | 100 |
1 | 3 | 700 |
2 | 3 | 20 |
After data <- haven::as_factor(data)
:
tenureType | localityType | monthlyRent |
---|---|---|
Full ownership | Rural | 200 |
Full ownership | Urban | 140 |
Full ownership | Camp | 500 |
For free | Urban | 100 |
Full ownership | Camp | 700 |
For free | Camp | 20 |
I have to convert the data to its labels as I want to make some processes on the texts.
I want to build a decision tree using C50
library, so I want to convert all columns that their values (as factor) is a numeric -- like monthlyRent -- to factor of intervals
I want the data to be for example like this:
tenureType | localityType | monthlyRent |
---|---|---|
Full ownership | Rural | 156-292 |
Full ownership | Urban | 20-156 |
Full ownership | Camp | 428 - 564 |
For free | Urban | 20-156 |
Full ownership | Camp | 564 - 700 |
For free | Camp | 20-156 |
- I need each numeric column to be converted to 5 categories
- The intervals calculated by:
( max - min ) / 5
In the above sample: (700 - 20 ) / 5 = 136.
Intervals are: [20-156], [156-292], [292-428], [428-564], [564-700].
I have 840 columns, so I don't know the columns names, I want the intervals to be dynamically, as such columns ranges are from 0 to 10 and others ranges 0 - 10000.
I want the best approach for this. If there is better approach than intervals calculated by ( max - min ) / 5
I'd like to know.