1 SVM

서포트 벡터 머신(support vector machine, SVM[1])은 기계 학습의 분야 중 하나로 패턴 인식, 자료 분석을 위한 지도 학습 모델이며, 주로 분류와 회귀 분석을 위해 사용한다. 두 카테고리 중 어느 하나에 속한 데이터의 집합이 주어졌을 때, SVM 알고리즘은 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 비확률적 이진 선형 분류 모델을 만든다. 만들어진 분류 모델은 데이터가 사상된 공간에서 경계로 표현되는데 SVM 알고리즘은 그 중 가장 큰 폭을 가진 경계를 찾는 알고리즘이다. SVM은 선형 분류와 더불어 비선형 분류에서도 사용될 수 있다. 비선형 분류를 하기 위해서 주어진 데이터를 고차원 특징 공간으로 사상하는 작업이 필요한데, 이를 효율적으로 하기 위해 커널 트릭을 사용하기도 한다.[위키 발췌]

학습(Learning)

기본 설정으로 SVM을 학습해 본다. 특별한 다른 설정 없이도 상당히 정확한 예측력을 가진다.

library("e1071")

fit.svm.1 <- svm(LABELS ~ ., data = sc.imp.up.train)
print(fit.svm.1)
summary(fit.svm.1)

# test with train data
pred.svm.1 <- predict(fit.svm.1, sc.imp.up.test)

table(OBSERV = sc.imp.up.test$LABELS, PRED = pred.svm.1)

#       PRED
# OBSERV  -1   1
#     -1 433   5
#     1    0 438


# compute decision values and probabilities:
pred.svm.dv.1 <- predict(fit.svm.1, sc.imp.up.test, decision.values = TRUE)

str(pred.svm.dv.1)
attr(pred.svm.dv.1, "decision.values")[1:4,]

#        13        14        15        19 
# 1.3331180 0.7904323 0.8577104 0.5051933 

# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(sc.imp.up.train[,-ncol(sc.imp.up.train)])),
     col = as.integer(sc.imp.up.train[,ncol(sc.imp.up.train)]),
     pch = c("o","+")[1:150 %in% fit.svm.1$index + 1])

Figure 1: 시각화(색상에 따른 분류, +는 SV)

변수 조정(Tunning)

이 테스트 데이터 분포에서는 기본 설정으로 학습하나, 변수를 조절하여 학습하나 별 차이가 없다. 여기서는 단지 절차에 대해 알아본다.

변수	설명
tune	Parameter Tuning of Functions Using Grid Search
tune.control	Control Parameters for the Tune Function
tune.knn	Convenience Tuning Wrapper Functions
tune.nnet	Convenience Tuning Wrapper Functions
tune.randomForest	Convenience Tuning Wrapper Functions
tune.rpart	Convenience Tuning Wrapper Functions
tune.svm	Convenience Tuning Wrapper Functions

# 이전 학습 모델을 출력한다.
fit.svm.1

# Call:
# svm(formula = LABELS ~ ., data = sc.imp.up.train)
# 
# 
# Parameters:
#    SVM-Type:  C-classification 
#  SVM-Kernel:  radial 
#        cost:  1 
#       gamma:  0.002159827 
# 
# Number of Support Vectors:  1038

# cost, gamma를 포함하여 넓게 설정

tune.svm <- tune(svm, LABELS ~ ., data = sc.imp.up.train, 
                 ranges = list(gamma = 2^(-10:1), cost = 2^(-2:4)),
                 tunecontrol = tune.control(sampling = "fix")
                 )

tune.svm

# Parameter tuning of ‘svm’:
# 
# - sampling method: fixed training/validation set 
# 
# - best parameters:
#      gamma cost
#  0.0078125  0.5
# - best performance: 0 


fit.svm.2 <- svm(LABELS ~ ., data = sc.imp.up.train, cost = 0.5, gamma = 0.0078125)


# test with train data
pred.svm.2 <- predict(fit.svm.2, sc.imp.up.test)

table(OBSERV = sc.imp.up.test$LABELS, PRED = pred.svm.1)
#       PRED
# OBSERV  -1   1
#     -1 433   5
#     1    0 438

GBM

Gradient boosting은 회귀 및 분류 문제에 대한 기계 학습 기술로서, 약한 예측 모델로 앙상블 형태로 예측 모델을 생성합니다, 전형적인 의사 결정 트리. 다른 부스팅 방법과 마찬가지로 단계적으로 모델을 작성하고, 임의의 미분 손실 함수의 최적화를 이용하여 모델을 일반화한다. gbm 패키지는 또한 확률적 그라디언트 부스팅 전략을 채택하고 있고 작지만 기본 알고리즘에 대한 중요한 조정했다.

1 분류 학습(learning) : bernoulli

gbm의 변수 distribution = "bernoulli"으로 설정하는 학습을 진행한다.

gbm 학습을 진행하기에 앞서 gbm이 인식할 수 있게 데이터 전처리 작업을 약간 수행한다.

# 훈련 집함
X1 <- sc.imp.up.train[, -length(sc.imp.up.train)]
Y1 <- sc.imp.up.train$LABELS

# 응답 값이 0, 1이여야 함
Y1 <- as.numeric(Y1)
Y1 <- ifelse(Y1 == 1, 0, 1)

# 검증 집합
X2 <- sc.imp.up.test[, -length(sc.imp.up.test)]
Y2 <- sc.imp.up.test$LABELS
Y2 <- as.numeric(Y2)
Y2 <- ifelse(Y2 == 1, 0, 1)

# unique(Y2)

기본 변수로 학습을 실행한다. 예측 능력이 만족하지 못한 수준으로 떨어지는 것을 확인한다.

# fit.gbm.1 <- gbm.fit(x = X1, y = Y1, distribution = "huberized", n.trees = 500)
fit.gbm.1 <- gbm.fit(x = X1, y = Y1, distribution = "bernoulli")


best.iter <- gbm.perf(fit.gbm.1,method="OOB")
print(best.iter)

print(fit.gbm.1)
summary(fit.gbm.1,n.trees=best.iter)



pred.gbm.1 <- predict(fit.gbm.1, newdata = X2, n.trees = best.iter, type='response')

hist(pred.gbm.1)
table(OBSERV = Y2, PRED = ifelse(pred.gbm.1 > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 359  79
#      1 203 235

1 - sum(ifelse(ifelse(pred.gbm.1 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.6780822

Tunning

변수 조정에 많이 사용되는 변수와 특지을 설명한다.

n.trees 트리의 전체 수. 기본값 100. 큰 수가 되게 선택해라, 그러면 모델을 실행한 후에 트리를 가지친다.
shrinkage 확장에서 각 트리에 적용. 기본값 0.001. 학습 비(earning rate) 또는 단계 크기 축소(step-size reduction)로 알려 짐. 더 작은 값일 수록, 일반적으로 약간 더 좋은 성능이 나타난다. 비용은 작은 값에서 실행에서 보다 더 많이 든다.
nteraction.depth 변수 상호 작용의 최대 값. 기본갑 1. 상호 작용 깊이를 선택하기 위해 교차 검증을 사용해라.
n.minobsinnode 트리의 종료 노드에서 관측의 최소 수. 기본값 10. 전체 가중치가 아니라, 실제 관측의 수임에 주의. 과적합에 중요한 영향을 줌. 이 변수의 감소는 표본 내 적합을 증가 시킨다, 그러나 과적합이 발생할 수 있다.

fit.gbm.2 <- gbm.fit(x = X1, y = Y1, distribution = "bernoulli"
                     , shrinkage = 0.0001 
                     , interaction.depth = 10
                     , n.minobsinnode = 5
                     , n.trees =  100)

best.iter <- gbm.perf(fit.gbm.2,method="OOB")
# print(best.iter)

# print(fit.gbm.2)
# summary(fit.gbm.2,n.trees=best.iter)

pred.gbm.2 <- predict(fit.gbm.2, newdata = X2, n.trees = best.iter, type='response')

table(OBSERV = Y2, PRED = ifelse(pred.gbm.2 > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 372  66
#      1  24 414

1 - sum(ifelse(ifelse(pred.gbm.2 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.8972603

# --------------------
# 반복 횟수 증가
fit.gbm.2.more <- gbm.more(fit.gbm.2, 900)
fit.gbm.2.more <- gbm.more(fit.gbm.2.more, 5000)

best.iter <- gbm.perf(fit.gbm.2.more,method="OOB")
pred.gbm.2.more <- predict(fit.gbm.2.more, newdata = X2, n.trees = best.iter, type='response')

table(OBSERV = Y2, PRED = ifelse(pred.gbm.2.more > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 378  60
#      1  15 423

1 - sum(ifelse(ifelse(pred.gbm.2.more > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.9143836


# =======================================================================================================
# ntrees = 3000 으로 학습을 진행한다. 
# 변수 조절 관점: TrainDeviance 값을 얼마나 빨리 0으로 만들 것인가?

# interaction.depth, n.minobsinnode 값이 커질수록 연산 시간 길어지고 
# 그리고 TrainDeviance값이 0으로 수렴하는 n.trees 반복은 빨라짐.

fit.gbm.3 <- gbm.fit(x = X1, y = Y1, distribution = "bernoulli"
                     , shrinkage = 0.001 
                     , interaction.depth = 20
                     , n.minobsinnode = 10
                     , n.trees =  3000)

best.iter <- gbm.perf(fit.gbm.3,method="OOB")
print(best.iter)

print(fit.gbm.3)
summary(fit.gbm.3, n.trees=best.iter)



pred.gbm.3 <- predict(fit.gbm.3, newdata = X2, n.trees = best.iter, type='response')


table(OBSERV = Y2, PRED = ifelse(pred.gbm.3 > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 428  10
#      1   0 438

1 - sum(ifelse(ifelse(pred.gbm.3 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.9885845

gbm은 변수를 조작하는 것 보단 반복학습을 증가시켜는 것이 예측 정밀도 향상에 크게 영향을 받는다.

교차 검증

10 배 교차검증을 수행한다. n.trees * cv.folds 만큼 트리가 생성되어 계산 시간이 오래 걸린다.


X1 <- sc.imp.up.train[, -length(sc.imp.up.train)]
Y1 <- sc.imp.up.train$LABELS

train <- cbind(X1, Y = Y1)

fit.gbm.4 <- gbm(Y ~ ., data = train
                 , distribution = "bernoulli"
                 , shrinkage = 0.0001 
                 , interaction.depth = 5
                 , n.minobsinnode = 5
                 , n.trees =  5000
                 , cv.folds = 10)

best.iter <- gbm.perf(fit.gbm.4,method="cv")
print(best.iter)

print(fit.gbm.4)
summary(fit.gbm.4, n.trees=best.iter)



pred.gbm.4 <- predict(fit.gbm.4, newdata = X2, n.trees = best.iter, type='response')


table(OBSERV = Y2, PRED = ifelse(pred.gbm.4 > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 360  78
#      1  33 405

1 - sum(ifelse(ifelse(pred.gbm.4 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.8732877

2 분류 학습(learning) : adaboost

0과 1출력을 위한 AdaBoost 지수 손실(exponential loss) 사용.

fit.gbm.3 <- gbm.fit(x = X1, y = Y1, distribution = "adaboost", n.trees = 100,
                     interaction.depth = 5,
                     shrinkage = 0.001,
                     n.minobsinnode = 10)

best.iter <- gbm.perf(fit.gbm.3,method="OOB")
print(best.iter)

#print(fit.gbm.3)
summary(fit.gbm.3,n.trees=best.iter)

pred.gbm.3 <- predict(fit.gbm.3, newdata = X2, n.trees = best.iter, type='response')

table(OBSERV = Y2, PRED = ifelse(pred.gbm.3 > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 350  88
#      1  53 385

1 - sum(ifelse(ifelse(pred.gbm.3 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.8390411


# -----------------------------------------------------------------------------------
# 반복 학습 진행

fit.gbm.3.more <- gbm.more(fit.gbm.3, 900)        # 처음 반복 학습 진행
fit.gbm.3.more <- gbm.more(fit.gbm.3.more, 4000)  # 반복 학습 5000회 까지 진행
fit.gbm.3.more <- gbm.more(fit.gbm.3.more, 5000)  # 반복 학습 10000회 까지 진행
fit.gbm.3.more <- gbm.more(fit.gbm.3.more, 5000)  # 반복 학습 15000회 까지 진행

best.iter <- gbm.perf(fit.gbm.3.more, method="OOB")

# 그래프 출력
print(best.iter)
summary(fit.gbm.3.more, n.trees=best.iter)

pred.gbm.3 <- predict(fit.gbm.3.more, newdata = X2, n.trees = best.iter, type='response')

# 1500회 반복 학습 결과
table(OBSERV = Y2, PRED = ifelse(pred.gbm.2.more > 0.5, 1, 0))
#       PRED
# OBSERV   0   1
#      0 378  60
#      1  15 423  >> 반복이 증가해도 어느 임계점을 넘으면 예측 정확도가 거의 증가하지 않음

1 - sum(ifelse(ifelse(pred.gbm.3 > 0.5, 1, 0) == Y2, 0, 1)) / length(Y2)
# [1] 0.9863014

Figure 2: print(best.iter)의 출력 결과, 곡선형태가 이룰 때까지 반복 진행 필요

Figure 3: summary(fit.gbm.3.more, n.trees=best.iter)에 의한 그래프. 변수의 상대 영향도

깨고 싶은 알

왜 R인가?

반도체 생산 공정 데이터 학습 :: part 04 Others