The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
- First, I wanted to merge two data based on the common column
- Second, I wanted to remove all columns including NA value
- Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
- Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next, 1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio 2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down