깨고 싶은 알: 반도체 생산 공정 데이터 학습

개요(Abstraction)

UCI 저장소에 있는 반도체 생산 공정 데이터(Data from a semi-conductor manufacturing process)를 이용한다. 이미 데이터는 전처리 과정이 완료된 데이터이다. rpart을 제외한 많은 모델에서 결측치가 있는 데이터를 사용할 수 없다, randomForests, randomUmiformForests, EM 학습 패키지 또한 결측치 데이터를 사용할 수 없다.

1 훈련 데이터 생성

우리가 사용하는 데이터는 최대한 현실 세계를 반영하는 최소한의 표본 데이터이므로 데이터 준비 과정 절차가 다르면 만족할만한 예측 결과를 가져올 수 없다. 데이터량을 줄이기 특징 데이터가 중복 없이 존재한다. 즉 예측 데이터에서 분할 절차를 잘못하면 예측 능력에 실망한 결과를 보인다.

상관관계가 높은 예측자 제거

Amelia II를 이용하여 결측 데이터 처리를 완료한 상태이다. 이전 전처리 과정의 결과 데이터를 사용한다.

# 상관관계가 높은 컬럼
fc <- findCorrelation (as.matrix(sc.data.a.1$imputations$imp1), 0.90)
length(fc)
# 322

# 상관관계가 높은 예측자를 제거한다.
sc.data.imp <- sc.data[, -fc]
dim(sc.data.imp)
#[1] 1567  268

상관관계가 높은 예측자는 특정 예측자에서 모델에 편향이 발생할 가능성이 있다. 그러므로 항상 학습하기 전 상관관계가 높은 예측자를 제거하고 모델 학습을 진행해야 한다.

훈련과 검증 데이터 생성

반도체 표본 데이터의 크기를 최소화하기 위해 특징을 나타내는 데이터는 중복해서 존재하지 않는다. 그러므로 이 절차 순서가 달라지면 만족 못하는 모델 예측 결과가 나타날 것이다.

# 데이터 묶기
sc.imp <- cbind(sc.data.a.1$imputations$imp1, LABELS = sc.label$V1)
sc.imp$LABELS <- factor(sc.imp$LABELS)


library(caret)

# 표본의 구분(Class)을 많은 쪽 구분과 동등하게 만듬
set.seed(8000)
sc.imp.up <- upSample(x = sc.imp[, -length(sc.imp)],
                            y = sc.imp$LABELS, 
                            yname = "LABELS") 

# 훈련과 검증 데이터의 분리
sc.imp.up.rs <- createDataPartition(sc.imp.up$LABELS, times = 5, p = 0.7)


# 훈련과 검증 데이터
sc.imp.up.train <- sc.imp.up[sc.imp.up.rs$Resample2, ]
sc.imp.up.test <- sc.imp.up[-sc.imp.up.rs$Resample2, ]

2 randomForest

훈련 및 학습

library(randomForest)

fit.rf.1 <- randomForest(LABELS ~ ., data = sc.imp.up.train, importance = TRUE)
attributes(fit.rf.1)
# $names
# [1] "call"            "type"            "predicted"       "err.rate"        "confusion"       "votes"     
# [7] "oob.times"       "classes"         "importance"      "importanceSD"    "localImportance" "proximity"
# [13] "ntree"           "mtry"           "forest"          "y"               "test"            "inbag"           
# [19] "terms"          

# $class
# [1] "randomForest.formula" "randomForest"   

# mtry 값 확인
fit.rf.1$mtry 
# [1] 21

# 예측
pred.rf.1 <- predict(fit.rf.1, newdata = sc.imp.up.test )

# 예측 수행
table(OBSERV = sc.imp.up.test$LABELS, PRED = pred.rf.1)
#       PRED
# OBSERV  -1   1
#     -1 438   0
#     1    0 438

변수 중요도와 에러 전개를 그림으로 확인한다.

plot(fit.rf.1)
varImpPlot(fit.rf.1)

Figure 1: 오류 전개

Figure 2: 변수 중요도

변수 조정(Tunning)

우선 tuneRF() 함수를 이용하여 최적의 mtry 값을 찾는다.

tuneRF(x = sc.imp.up.train[, -length(sc.imp.up.train)], y = sc.imp.up.train$LABELS)

Figure 3: tuneRF() 결과, mtry 값 21은 앞에서 출력한 값과 같다.

변수를 조정하여 학습하고 예측하면 동일한 결과를 보인다.

fit.rf.2 <- randomForest(LABELS ~ ., data = sc.imp.up.train, importance = TRUE,
                         proximity = T,
                         ntree=200,
                         mtry = 11,
                         nodesize = 1)

pred.rf.2 <- predict(fit.rf.2, newdata = sc.imp.up.test )

table(OBSERV = sc.imp.up.test$LABELS, PRED = pred.rf.2)
#       PRED
# OBSERV  -1   1
#     -1 438   0
#     1    0 438

3 randomUniformForest

randomUniformForest의 정의는 학습 데이터에서 많은 무작위 추출(randomized) 그리고 가지 치지 않는(unpruned) 이진 결정 트리(binary decision trees)를 사용하는 앙상블 모델(ensemble model)이다. 변수를 심도 있게 분석할 수 있으며, 연속적으로 입력되는 데이터에 대해 점증적 학습을 지원한다.

훈련

randomForests애서 변수 최적화를 시도한 값으로 훈련을 시킨다.

library(randomUniformForest)

Y1 <- sc.imp.up.train$LABELS
X1 <- sc.imp.up.train[, -ncol(sc.imp.up.train)]

# run model: default options
fit.ruf.1 <- randomUniformForest(X = X1, Y = Y1,
                                 ntree=200,
                                 mtry = 11,
                                 nodesize = 1,
                                 importance = TRUE
                                 )

# Call:
# randomUniformForest.default(X = X1, Y = Y1, ntree = 200, mtry = 11, 
#     nodesize = 1)
# 
# Type of random uniform forest: Classification
# 
#                            paramsObject
# ntree                               200
# mtry                                 11
# nodesize                              1
# maxnodes                            Inf
# replace                            TRUE
# bagging                           FALSE
# depth                               Inf
# depthcontrol                      FALSE
# OOB                                TRUE
# importance                         TRUE
# subsamplerate                         1
# classwt                           FALSE
# classcutoff                       FALSE
# oversampling                      FALSE
# outputperturbationsampling        FALSE
# targetclass                          -1
# rebalancedsampling                FALSE
# randomcombination                 FALSE
# randomfeature                     FALSE
# categorical variables             FALSE
# featureselectionrule            entropy
# 
# Out-of-bag (OOB) evaluation
# OOB estimate of error rate: 0%
# OOB error rate bound (with 1% deviation): 0%
# 
# OOB confusion matrix:
#           Reference
# Prediction   -1    1 class.error
#         -1 1025    0           0
#         1     0 1025           0
# 
# OOB estimate of AUC: 1
# OOB estimate of AUPR: 1
# OOB estimate of F1-score: 1
# OOB (adjusted) estimate of geometric mean: 1 
# 
# Breiman's bounds
# Expected prediction error (under approximatively balanced classes): 0.5%
# Upper bound: 3.48%
# Average correlation between trees: 0.0151 
# Strength (margin): 0.8675 
# Standard deviation of strength: 0.1618 

# call the summary() function gives some details about the forest and
# global variable importance

과적합이 발생할 가능성이 있는지 주의 깊게 살펴야 한다. 이를 위해, Breiman’s bounds와 그들의 상세를 확인해야 한다. OOB 오류는 어떤 Breiman’s bound 안에 있음을 주목할 수 있다. 이는 모델이 잘 적합되었다고 판단할 수 있다(OOB estimate of error rate값이 Expected prediction error (under approximatively balanced classes) 값 보다 적음).

변수 중요도 확인

전역 변수 중요도(global variable importance)을 확인한다. summary() 함수를 실행하면 콘솔 출력과 플롯이 보인다. 그러나 정보 획득에 기반하는 전역 변수 중요도가 전부 0이어서 플롯이 보이지 않는다.

# global variable importance
summary(fit.ruf.1)

 
# Global Variable importance:
# Note: most predictive features are ordered by 'score' and plotted. Most discriminant ones
# should also be taken into account by looking 'class' and 'class.frequency'.
# 
# 전역 변수 중요도의 percent.importance가 모두 0 이다.
#     variables score class class.frequency percent percent.importance
# 1        V104    27    -1            0.60  100.00                  0
# 2        V424    23    -1            0.55   84.86                  0
# 3         V60    22    -1            0.54   82.20                  0
# 4        V148    20    -1            0.59   75.33                  0
# 5         V65    20     1            0.53   73.72                  0
# 6        V588    19    -1            0.56   71.83                  0
# 7        V471    19    -1            0.62   70.49                  0
# 8        V133    19    -1            0.58   69.87                  0
# 9        V127    19    -1            0.61   69.71                  0
# ......
#  [ reached getOption("max.print") -- omitted 297 rows ]
# 
# Average tree size (number of nodes) summary:  
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     257     285     299     299     313     339 
# 
# Average Leaf nodes (number of terminal nodes) summary:  
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#     129     143     150     150     157     170 
# 
# Leaf nodes size (number of observations per leaf node) summary:  
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    1.00    4.00    9.00   13.75   18.00  221.00 
# 
# Average tree depth : 8 
# 
# Theoretical (balanced) tree depth : 11

상호작용 시각화

fit.imp.ruf <- importance(fit.ruf.1, Xtest = X1)

# 1 - Global Variable Importance (30 most important based on information gain) :
# Note: most predictive features are ordered by 'score' and plotted. Most discriminant ones
# should also be taken into account by looking 'class' and 'class.frequency'.
# 
#    variables score class class.frequency percent percent.importance
# 1       V104    27    -1            0.60  100.00                  0
# 2       V424    23    -1            0.55   84.86                  0
# 3        V60    22    -1            0.54   82.20                  0
# 4       V148    20    -1            0.59   75.33                  0
# 5        V65    20     1            0.53   73.72                  0
# 6       V588    19    -1            0.56   71.83                  0
# 7       V471    19    -1            0.62   70.49                  0
# 8       V133    19    -1            0.58   69.87                  0
# 9       V127    19    -1            0.61   69.71                  0
# 10      V435    19    -1            0.55   69.12                  0
# 11        V1    18    -1            0.55   68.96                  0
# 12      V492    18    -1            0.60   67.90                  0
# 13      V406    18    -1            0.52   67.84                  0
# 14      V566    18    -1            0.60   67.77                  0
# 15      V130    18    -1            0.71   66.96                  0
# 16      V239    18    -1            0.58   66.88                  0
# 17      V512    18    -1            0.61   66.70                  0
# 18      V447    18    -1            0.62   65.47                  0
# 19       V17    18    -1            0.59   65.31                  0
# 20      V432    17    -1            0.51   65.01                  0
# 21      V350    17    -1            0.55   64.48                  0
# 22      V113    17    -1            0.64   64.39                  0
# 23      V577    17    -1            0.64   64.09                  0
# 24      V561    17    -1            0.55   63.88                  0
# 25      V157    17    -1            0.56   63.85                  0
# 26      V511    17    -1            0.55   63.76                  0
# 27      V220    17    -1            0.51   63.67                  0
# 28      V317    17    -1            0.62   63.30                  0
# 29      V564    17    -1            0.67   63.04                  0
# 30      V349    17    -1            0.58   62.95                  0
# 
# 
# 2 - Local Variable importance
# Variables interactions (10 most important variables at first (columns) and second (rows) order) :
# For each variable (at each order), its interaction with others is computed.
# 
# 예측에 참여한 모든 변수와 가장 많이 상호작용하는 3개 예측자 간에 상관계수 값이 출력된다.(빠짐)
# 
# 
# Variable Importance based on interactions (10 most important) :
#    V57     V1    V16   V131    V34    V56    V35   V118   V130   V424 
# 0.0269 0.0239 0.0178 0.0162 0.0132 0.0127 0.0127 0.0122 0.0113 0.0112 
# 
# Variable importance over labels (10 most important variables conditionally to each label) :
#      Class -1 Class 1
# V57      0.11    0.00
# V131     0.04    0.00
# V118     0.04    0.00
# V16      0.04    0.01
# V130     0.04    0.00
# V46      0.00    0.04
# V572     0.00    0.03
# V1       0.02    0.03
# V33      0.01    0.03
# V424     0.01    0.03
# 
# 
# See ...$localVariableImportance$obsVariableImportance to get variable importance for each observation.
# 
# Call clusterAnalysis() function to get a more compact and complementary analysis.
#  Type '?clusterAnalysis' for help.
# 
# Call partialDependenceOverResponses() function to get partial dependence over responses
# for each variable. Type '?partialDependenceOverResponses' for help.

plot(fit.imp.ruf, Xtest = X1) 을 실행하면 상호작용에 기반한 4개의 플롯과 1개의 박스 플롯이 보인다. 상위 4개의 예측자에 대해 프롬프트 입력하면 확인할 수 있다.

부분 의존도

예측자와 구분(Class)간 분포를 박스 플롯으로 확인할 수 있다. 도표를 이해하기 힘들어 생략한다.

트리 플롯

1개의 결정 트리에서 분할이 어떻게 선호되는지 비교하기를 원할 것이다. 다음에서 다른 중요한 기능의 하나는 쉽게 이해할 수 있는 규칙을 제공해야 한다. Random Uniform Forests에서 트리는 강한 임의성이다 (결정론적인(deterministic) CART와 달리), 하나의 트리 시각화는 해석에 적합하지 않다 그리고 다른 트리보다 더 좋은 트리는 없다.

tree_100 = getTree.randomUniformForest(fit.ruf.1,100)

plotTree(tree_100, xlim = c(1,80), ylim = c(1,11))

Figure 4: forest 트리

4 ExtraTrees

Extremely Randomized Trees(ExtraTrees)

extraTrees (extremely randomized trees) 의 사용은 함수의 사용 측면에서 randomForest 패키지와 비슷하다. 중요한 차이는 각 노드에서 RandomForest는 특징을 위한 최적의 분할 임계점을 선택할 때, ExtraTrees는 (균등) 임의(randomly) 분할을 선택한다. 비슷한 점은 가장 큰 이득을 가지는 특징은 분할 임계점이 결정된 후에 선택함에 있다.

이 패키지를 구동하려면 rJava 패키지를 설치하고, Java도 설치해야 한다.

library(extraTrees)

Y1 <- sc.imp.up.train$LABELS
X1 <- sc.imp.up.train[, -ncol(sc.imp.up.train)]

fit.et.1 <- extraTrees(x = X1, y = Y1)

# attributes(fit.et.1)

Y1.test <- sc.imp.up.test$LABELS
X1.test <- sc.imp.up.test[, -ncol(sc.imp.up.test)]

pred.et.1  <- predict(fit.et.1, X1.test)

## accuracy
mean(Y1.test == pred.et.1)
# [1] 1

table(OBSERV = Y1.test, PRED = pred.et.1)
#       PRED
# OBSERV  -1   1
#     -1 438   0
#     1    0 438


## class probabilities
pred.et.prob.1 = predict(fit.et.1, X1.test, probability=TRUE)
head(pred.et.prob.1)

# 확율 분포를 히스토그램을 통해 확인해 본다, 불량(1)인 경우 확인.
hist(pred.et.prob.1[,2])

깨고 싶은 알

왜 R인가?

반도체 생산 공정 데이터 학습 :: part 03 randomForests family