1

The First csv file is called "CLAIM" and these are parts of data CLAIM

The second csv file is called "CUSTOMER" and these are parts of data CUSTOMER

  1. First, I wanted to merge two data based on the common column
  2. Second, I wanted to remove all columns including NA value
  3. Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
  4. Fourth, I wanted to remove the empty row of OCCP_GRP_1

Expecting form is

dim(data_fin)
## [1] 114886     11
head(data_fin)
##   CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1       1           1   2  47   3.사무직        2       Y           Y
## 2       1           1   2  47   3.사무직        2       Y           Y
## 3       1           1   2  47   3.사무직        2       Y           Y
## 4       1           1   2  47   3.사무직        2       Y           Y
## 5       2           1   1  53   3.사무직        2       Y           Y
## 6       2           1   1  53   3.사무직        2       Y           Y
##   DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1    52450    52450     0.4343986
## 2    24000    24000     0.8823529
## 3    17500    17500     0.7272727
## 4    47500    47500     0.9217391
## 5    99100    99100     0.8623195
## 6     7817     7500     0.8623195
str(data_fin)
## 'data.frame':    114886 obs. of  11 variables:
##  $ CUST_ID      : int  1 1 1 1 2 2 2 3 4 4 ...
##  $ DIVIDED_SET  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEX          : int  2 2 2 2 1 1 1 1 2 2 ...
##  $ AGE          : int  47 47 47 47 53 53 53 60 64 64 ...
##  $ OCCP_GRP_1   : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
##  $ CHLD_CNT     : int  2 2 2 2 2 2 2 0 0 0 ...
##  $ WEDD_YN      : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
##  $ CHANG_FP_YN  : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
##  $ DMND_AMT     : int  52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
##  $ PAYM_AMT     : int  52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
##  $ NON_PAY_RATIO: num  0.434 0.882 0.727 0.922 0.862 ...

so I wrote down the code like

#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)

Next, 1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio 2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value

I don't know how to write it down

r2evans
  • 141,215
  • 6
  • 77
  • 149
Hailey
  • 9
  • 2
  • 2
    I edited the question so that images are visible instead of requiring clicks. This is a temporary fix: please replace the images with *copyable text we can actually use*. I'm not going to spend any time transcribing an image, the impetus is on you to make it easy for us. Please see [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (hint: `dput`, copy, paste, `ctrl-k`). Additionally, this can easily be *reduced*, please see [minimal examples](http://stackoverflow.com/help/mcve). – r2evans Apr 17 '17 at 19:43

0 Answers0