ISLR Note

· โ˜• 34 min read · โœ๏ธ Hoontaek Lee
๐Ÿท๏ธ
  • #2020
  • #Book Review
  • #R
  • 2. Statistical Learning

    2.1. What Is Statistical Learning?

    Y = f(X) + e
    ์—ฌ๊ธฐ์„œ ํ•จ์ˆ˜ f๋Š” X๊ฐ€ ๋‹ด๊ณ  ์žˆ๋Š” Y์— ๋Œ€ํ•œ systematic information์ด๋‹ค.
    e๋Š” f๊ฐ€ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•˜๋Š” random error term์ด๋‹ค.
    e๋Š” X์— ๋…๋ฆฝ์ด๊ณ  ํ‰๊ท ๊ฐ’์€ 0์ด๋‹ค.

    Statistical learning์€ ์œ„์˜ ํ•จ์ˆ˜ f๋ฅผ ์ฐพ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ๋ชจ๋‘ ์ผ์ปซ๋Š”๋‹ค.

    2.1.1. Why Estimate f?

    • prediction

      • ์ธก์ •๊ฐ’ X๋กœ Y ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค(Y = f(X)).
      • f๋Š” black box๋กœ ๋‚จ์•„๋„ ์ƒ๊ด€ ์—†๋‹ค(Y๋งŒ ์ž˜ ๋งž์ถ”๋ฉด ๋œ๋‹ค).
      • ์ด ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์—๋Ÿฌ๋Š” ๋‘ ๊ฐ€์ง€: reducible, irreducible.
      • reducible: f๋ฅผ ๋” ์ž˜ ์˜ˆ์ธกํ•˜๋ฉด ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.
      • irreducible: ์ธก์ • ํ•ญ๋ชฉ์—์„œ ์–ด๋–ค ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋ฅผ ๋†“์น˜๊ณ  ์žˆ๊ฑฐ๋‚˜,
        ํ˜น์€ ์ธก์ • ๋ถˆ๊ฐ€๋Šฅํ•œ ์š”์†Œ(์˜ˆ: ์•ฝ๋ฌผ ํˆฌ์—ฌ ๋•Œ ํ™˜์ž์˜ ์‹ฌ๋ฆฌ)๊ฐ€ ๊ฐœ์ž…ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•œ๋‹ค.
    • inference

      • Y์™€ ๊ฐ X์˜ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ํŒŒ์•…ํ•˜๊ณ  ์‹ถ๋‹ค.
      • f๋Š” black box๋กœ ๋‚จ์œผ๋ฉด ์•ˆ ๋œ๋‹ค.
      • ์–ด๋–ค X๊ฐ€ Y์™€ ๋” ๊ด€๊ณ„๊ฐ€ ๊ฐ•ํ•œ์ง€, ์—ฌ๋Ÿฌ X์™€ Y์˜ ๊ด€๊ณ„๋Š” ์„ ํ˜•์ธ์ง€ ๋น„์„ ํ˜•์ธ์ง€ ๋“ฑ.

    ๋ฌธ์ œ์— ๋”ฐ๋ผ์„œ

    • prediction๊ณผ inference ์ค‘ ํ•˜๋‚˜๋งŒ, ํ˜น์€ ๋‘˜ ๋‹ค ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ๊ฐ„๋‹จํ•œ ์„ ํ˜• ๋ชจํ˜•์œผ๋กœ๋„ ๋งŒ์กฑํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋ณต์žกํ•œ ๋น„์„ ํ˜• ๋ชจํ˜•์„ ์‚ฌ์šฉํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

    2.1.2. How Do We Estimate f?

    f๋ฅผ ์ฐพ๋Š” ๋ฐ๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ, ์ด๋“ค์ด ๊ณตํ†ต์œผ๋กœ ๊ฐ€์ง€๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.

    • training data: {(x~1~, y~1~), …, (x~n~, y~n~)}
    • parametric or non-parametric method ์ค‘์— ํ•˜๋‚˜๋‹ค.
    • parametric methods
      • f์˜ ํ˜•ํƒœ(equation, distribution, …)๋ฅผ ์ƒ์ •ํ•œ ํ›„(parameter๊ฐ€ ์ •ํ•ด์ง„๋‹ค)
      • parameter ๊ฐ’์„ ์ถ”์ •ํ•ด์„œ f๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.
      • f๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ช‡๊ฐœ์˜ paraemters๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฌธ์ œ๋กœ ๋ฐ”๋€๋‹ค(์‰ฌ์›Œ์ง„๋‹ค).
      • f์˜ ํ˜•ํƒœ๋ฅผ ์ž˜๋ชป ์ƒ์ •ํ–ˆ๋‹ค๋ฉด ์˜ค์ฐจ๊ฐ€ ์ปค์งˆ ์ˆ˜ ์žˆ๋‹ค.
      • ์œ„์˜ ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด f์˜ ํ˜•ํƒœ์— flexibility๋ฅผ ๋†’์ด๊ณ ์ž ํ•œ๋‹ค๋ฉด ๋” ๋งŽ์€ parameters๊ฐ€ ํ•„์š”ํ•˜๊ณ ,
        ๋”ฐ๋ผ์„œ overfitting์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ ๋„ ๋†’์•„์ง„๋‹ค.
    • non-parametric methods
      • f์— ํŠน์ •ํ•œ ํ˜•ํƒœ๋ฅผ ์ƒ์ •ํ•˜์ง€ ์•Š๋Š”๋‹ค.
      • ๋•Œ๋ฌธ์— f๊ฐ€ ์‹ค์ œ๋กœ ์–ด๋–ค ํ˜•ํƒœ๋ฅผ ๊ฐ–๋”๋ผ๋„ ์ •ํ™•ํžˆ ์ถ”์ •ํ•  ์ž ์žฌ๋ ฅ์ด ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
      • ๋‹จ, parameter estimation์ด๋ผ๋Š” ์‰ฌ์šด ๋ฌธ์ œ๋กœ ๋ฐ”๊พธ์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—
        parametric methods๋ณด๋‹ค ํ›จ์”ฌ ๋งŽ์€ ๊ด€์ธก ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

    2.1.3. The Trade-Off Between Prediction Accuracy and Model Interpretability

    restrictive –> interpertable (inference ๋ชฉ์ ์— ์ ํ•ฉ)
    ํ•˜์ง€๋งŒ, prediction ๋ชฉ์ ์—์„œ๋„ ๊ผญ the most flexible approach๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ์ •๋‹ต์€ ์•„๋‹ˆ๋‹ค(2.2์—์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •).

    2.1.4. Supervised Versus Unsupervised Learning

    supervised: Y ๊ด€์ธก๊ฐ’ ์กด์žฌ <-> unsupervised

    2.1.5. Regression Versus Classification Problems

    response variable์ด
    ์—ฐ์†ํ˜• –> regression
    ๋ฒ”์ฃผํ˜• –> classification

    ์ผ๋ฐ˜์ ์œผ๋กœ predictor๋Š” ์—ฐ์†ํ˜•, ๋ฒ”์ฃผํ˜• ๊ฐ๊ฐ์— ๋งž์ถฐ ์ฝ”๋”ฉ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ํŒ์ •์—์„œ๋Š” ๋œ ์ค‘์š”ํ•˜๋‹ค.

    2.2. Assessing Model Accuracy

    ๋ชจ๋“  ์ƒํ™ฉ์— ๋งŒ๋Šฅ์ธ ๋ชจ๋ธ์€ ์—†๋‹ค.
    ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๋ฌธ์ œ์— ๋”ฐ๋ผ, ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋ชจ๋ธ์ด ๋งค๋ฒˆ ๋‹ฌ๋ผ์ง„๋‹ค.

    ๊ทธ๋ž˜์„œ ์ƒํ™ฉ๋งˆ๋‹ค ๋งค๋ฒˆ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๋น„๊ตํ•ด๋ณธ๋‹ค.

    2.2.1. Measuring the Quality of Fit

    Mean squared error(MSE): regression setting์—์„œ๋Š” ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.

    MSE
    = $\frac{1}{n}$$\sum_{i=1}^{n}$(*y*~*i*~ - $\hat{f}$(*x*~*i*~))^2^
    = Ave(*y*~*i*~ - $\hat{f}$(*x*~*i*~)^2^)

    training MSE vs test MSE: ์šฐ๋ฆฌ๊ฐ€ ๊ด€์‹ฌ ์žˆ๋Š” ๊ฑด minimize(test MSE).

    (Figure 2.9)

    1. training MSE๋Š” flexibility๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๋‹จ์กฐ ๊ฐ์†Œํ•˜๋Š” ํŒจํ„ด์„ ๋ณด์ด๊ณ ,
    2. test MSE๋Š” flexibility์™€ U์žํ˜• ๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค.
      ์ฆ‰, training MSE๋ฅผ ๋‚ฎ์ถ”๊ธฐ ์œ„ํ•ด flexibility๋ฅผ ๋งˆ๊ตฌ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด
      ์ •์ž‘ test MSE๋Š” restrictive model๋ณด๋‹ค ์˜คํžˆ๋ ค ๋” ๋†’์•„์งˆ ์ˆ˜ ์žˆ๋‹ค(overfitting).

    2์˜ U์žํ˜• ๊ด€๊ณ„๋กœ๋ถ€ํ„ฐ test MSE์— ์ตœ์†Ÿ๊ฐ’์ด ์กด์žฌํ•จ์„ ํ™•์ธํ–ˆ๋Š”๋ฐ,
    ์ตœ์†Œ MSE๋ฅผ ๊ฐ€์ง€๋Š” model์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•๋„ ๋‹ค๋ฃจ๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค(์˜ˆ: 5์žฅ์˜ cross-validation).

    2.2.2. The Bias-Variance Trade-Off

    test MSE ๊ธฐ๋Œ“๊ฐ’ = ํ•จ์ˆซ๊ฐ’์˜ ๋ถ„์‚ฐ + ํ•จ์ˆซ๊ฐ’์˜ bias ์ œ๊ณฑ + ์˜ค์ฐจํ•ญ ๋ถ„์‚ฐ
    ์šฐ๋ณ€์˜ ์„ธ ํ•ญ ๋ชจ๋‘ 0 ์ด์ƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ขŒ๋ณ€์˜ ์ตœ์†Ÿ๊ฐ’์€ ์˜ค์ฐจํ•ญ ๋ถ„์‚ฐ(irreducible error)์ด ๋œ๋‹ค.

    flexible f –> high variance(๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ f ํ˜•ํƒœ๊ฐ€ ์‰ฝ๊ฒŒ ๋ณ€ํ•œ๋‹ค), low bias(๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋งž์ถ˜๋‹ค)

    (Figure 2.12)
    flexibility๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ธฐ ์‹œ์ž‘ํ•˜๋ฉด

    ์ฒ˜์Œ์—๋Š”
    bias๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ์†๋„ > variance๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์†๋„ –> test MSE ๊ฐ์†Œ

    ์ผ์ • ์ง€์ ์ด ์ง€๋‚˜๋ฉด
    bias๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ์†๋„ < variance๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์†๋„ –> test MSE ์ฆ๊ฐ€

    2.2.3. The Classification Setting

    MSE –> regression setting์—์„œ ์‚ฌ์šฉํ•œ๋‹ค.
    classification setting์—์„œ๋Š” error rate๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค(minimize).

    error rate = (y ๋ฒ”์ฃผ๋ฅผ ์ž˜๋ชป ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜) / (์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜)

    5. Resampling Methods

    Resampling Methods

    • ๊ฐ™์€ training data์—์„œ ์ƒ˜ํ”Œ ์ถ”์ถœ -> model fitting ๋ฐ˜๋ณตํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.
    • ์–ด๋–ค ์ถ”๊ฐ€ ์ •๋ณด(?)๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
    • cross-validation: flexibility ๊ฒฐ์ •ํ•˜๊ฑฐ๋‚˜ test error ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.
    • bootstrap: parameter estimate accuracy ํ‰๊ฐ€ํ•  ๋•Œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.

    model assessment: model performance ํ‰๊ฐ€
    model selection: model flexibility ์„ ํƒ

    5.1. Cross-Validation

    5.1.1. The Validation Set Approach

    ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ™์€ ํฌ๊ธฐ์˜ ๋‘ subsets - training set + validation set (or hold-out set) - ์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค. ๋žœ๋คํ•˜๊ฒŒ.
    training set์„ ์ด์šฉํ•ด model fitting ํ›„ validation set์œผ๋กœ test error (e.g. MSE) ๊ณ„์‚ฐํ•œ๋‹ค.

    • ๋‘ ๊ฐ€์ง€ ๋‹จ์ 
      • ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆŒ ๋•Œ๋งˆ๋‹ค MSE๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค.
      • model fitting์— ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค –> MSE๊ฐ€ ๊ณผ๋Œ€ํ‰๊ฐ€๋œ๋‹ค.

    ์œ„ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•œ๋‹ค –> cross-validation

    5.1.2. Leave-One-Out Cross-Validation

    (x~1~, y~1) ์ œ์™ธํ•œ n-1 ๊ฐœ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด trainingํ•œ๋‹ค.
    ๋‚จ๊ฒจ๋‘” ๋ฐ์ดํ„ฐ๋กœ test error rate(MSE~1~)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
    ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ test ๋ฐ์ดํ„ฐ๋ฅผ (x~n~, y~n)๊นŒ์ง€ ๋ณ€๊ฒฝํ•˜๋ฉฐ n๋ฒˆ ์ง„ํ–‰, MSE~n~๊นŒ์ง€ ๊ตฌํ•œ๋‹ค.

    test error rate = CV~(n)~ = Ave(MSE~i~)

    • LOOCV ์žฅ์ (= validation set approach์˜ ๋‹จ์  ํ•ด๊ฒฐ)
      • ๋ฐ์ดํ„ฐ๋ฅผ ์ถฉ๋ถ„ํžˆ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— bias๋ฅผ ๊ณผ๋Œ€ํ‰๊ฐ€ํ•  ์œ„ํ—˜๋„๊ฐ€ validation set approach๋ณด๋‹ค ๋‚ฎ๋‹ค.
      • test error rate๊ฐ€ ์ผ์ •ํ•˜๋‹ค.

    LOOCV๋Š” ๋Œ€๋ถ€๋ถ„์˜ predictive modeling์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
    least square linear (or polynomial) regression์—์„œ๋Š” CV~(n)~์„ ๋” ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค(์‹ 5.2).

    5.1.3. k-Fold Cross-Validation

    LOOCV์™€ ๋น„์Šทํ•œ๋ฐ, test๋กœ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๊ฐœ์”ฉ์ด ์•„๋‹ˆ๋ผ ํ•œ ๊ทธ๋ฃน์”ฉ ๋‚จ๊ธฐ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

    fold = group
    ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ™์€ ํฌ๊ธฐ์˜ k๊ฐœ ๊ทธ๋ฃน์œผ๋กœ ๋žœ๋คํ•˜๊ฒŒ ๋‚˜๋ˆˆ๋‹ค.
    1^st^ ๊ทธ๋ฃน์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€๋กœ training, ๋‚จ๊ฒจ๋‘” ๊ทธ๋ฃน์œผ๋กœ error rate ๊ณ„์‚ฐ(MSE~1~)
    k^th^ ๊ทธ๋ฃน๊นŒ์ง€ ์ง„ํ–‰(MSE~k~).

    test error rate = CV~(k)~ = Ave(MSE~i~)

    LOOCV๋Š” k=n์ธ k-Fold CV๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
    ๋ณดํ†ต k๋Š” 5๋‚˜ 10์„ ์‚ฌ์šฉํ•œ๋‹ค.

    • ๊ณ„์‚ฐ๋Ÿ‰์ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ
    • 5๋‚˜ 10์„ ์‚ฌ์šฉํ•ด๋„ LOOCV์™€ flexiblility with minimum MSE ๊ฒฐ๊ณผ๊ฐ€ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ

    5.1.4. Bias-Variance Trade-Off for k-Fold Cross-Validation

    k-fold CV์—์„œ k ๊ฐ’์— ๋”ฐ๋ผ bias-variance trade-off๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

    k๊ฐ€ n์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก
    ์˜ˆ์ธก์น˜์˜ bias๋Š” ๊ฐ์†Œํ•˜๊ณ () ๋ถ„์‚ฐ์€ ์ฆ๊ฐ€ํ•œ๋‹ค.

    k๋กœ 5๋‚˜ 10์„ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ์ด์œ  : test error rate๊ฐ€ ๊ฐ€์žฅ ์ž‘๊ธฐ ๋•Œ๋ฌธ

    5.1.5. Cross-Validation on Classification Problems

    CV ์ˆ˜์‹์—์„œ MSE ๋Œ€์‹  ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•œ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ์‚ฌ์šฉํ•œ๋‹ค.

    5.2. The Bootstrap

    ๊ด€์ธก ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•์ด๋‹ค.
    ๋žœ๋คํ•˜๊ฒŒ, ์›๋ž˜ ๊ด€์ธก๋ฐ์ดํ„ฐ์™€ ๊ฐ™์€ ํฌ๊ธฐ๋งŒํผ ์žฌ์ถ”์ถœํ•œ๋‹ค.
    ๋ณต์› ํ˜น์€ ๋น„๋ณต์› ์ถ”์ถœ ์„ ํƒ.

    ์˜ˆ)
    ๋ฐ์ดํ„ฐ๊ฐ€ ์„ธ ๊ฐœ ๋ฐ–์— ์—†์–ด๋„
    ํฌ๊ธฐ๊ฐ€ 3์ธ ๋ฐ์ดํ„ฐ์…‹ B๊ฐœ๋ฅผ ์ƒˆ๋กœ ๋งŒ๋“ค์–ด์„œ
    ๊ฐ ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค estimates ๊ณ„์‚ฐํ•˜๊ณ (์ด B๊ฐœ), ์ด๋กœ๋ถ€ํ„ฐ standard error ๊ณ„์‚ฐํ•œ๋‹ค.

    • can be used to quantify the uncertainty associated with a given estimatior or statistical learning method (์˜ˆ: ์„ ํ˜•ํšŒ๊ท€์—์„œ ์ถ”์ •ํ•œ ๋ชจ์ˆ˜๊ฐ’์˜ ํ‘œ์ค€์˜ค์ฐจ).
    • ์„ ํ˜•ํšŒ๊ท€ ์™ธ ๋งŽ์€ ๋ชจํ˜•์—๋„ ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅ!
    (์‹ 5.8์—์„œ)
    
    ์™œ ๊ด„ํ˜ธ ์•ˆ์ด alpha - alpha_hat์ด ์•„๋‹ˆ์ง€?  
        
    alpha์˜ ์ฐธ๊ฐ’์„ ๋ชจ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์—  
    alpha_hat์„ ์ฐธ๊ฐ’ ๋Œ€์‹  ์‚ฌ์šฉํ•˜๊ณ , ๊ฐ alpha๋งˆ๋‹ค bootstrap์„ B๋ฒˆ ์‹คํ–‰ํ•ด์„œ alpha์˜ SS๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฑด๊ฐ€?
    

    5.3. Lab: Cross-Validation and the Bootstrap

    5.3.1. The Validation Set Approach

    1
    2
    3
    4
    5
    6
    7
    8
    
    library(ISLR)
    set.seed(1) # set seed to get the same results at a later time
    
    train = sample(x = 392, size = 196) # training subset
    
    
    # fit a simple linear model
    
    lm.fit = lm(mpg ~ horsepower, data = Auto, subset = train)
    attach(Auto)
    mean((mpg - predict(lm.fit, Auto))[-train] ^ 2) # compute MSE
    
    
    ## [1] 23.26601
    
    1
    2
    3
    
    # fit a quadratic linear model 
    
    lm.fit2 = lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
    mean((mpg - predict(lm.fit2, Auto))[-train] ^ 2) # compute MSE
    
    
    ## [1] 18.71646
    
    1
    2
    3
    
    # fit a cubic linear model 
    
    lm.fit3 = lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
    mean((mpg - predict(lm.fit3, Auto))[-train] ^ 2) # compute MSE
    
    
    ## [1] 18.79401
    
    1
    
    plot(horsepower, mpg) # looks like a curve fit is better than a straight line
    
    

    5.3.2. Leave-One-Out Cross-Validation

    1
    2
    3
    4
    
    lm.fit = lm(mpg ~ horsepower, data = Auto)
    glm.fit = glm(mpg ~ horsepower, data = Auto)
    
    coef(lm.fit)
    
    ## (Intercept)  horsepower 
    ##  39.9358610  -0.1578447
    
    1
    
    coef(glm.fit) # glm() w/o family argument --> lm()
    
    
    ## (Intercept)  horsepower 
    ##  39.9358610  -0.1578447
    
    1
    2
    3
    4
    5
    
    ## cv for a simple linear fit
    
    library(boot)
    glm.fit = glm(mpg ~ horsepower, data = Auto)
    cv.err = cv.glm(Auto, glm.fit)
    cv.err$delta # CV(n) results: standard one, bias-corrected one
    
    
    ## [1] 24.23151 24.23114
    
    1
    2
    3
    4
    5
    6
    7
    
    ## cv iteration for polynomial fit of order 1 to 5 
    
    cv.error = rep(0, 5)
    for(i in 1:5){
      glm.fit = glm(mpg ~ poly(horsepower, i), data = Auto)
      cv.error[i] = cv.glm(data = Auto, glmfit = glm.fit)$delta[1]
    }
    cv.error
    
    ## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
    

    5.3.3. k-Fold Cross-Validation

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    set.seed(17)
    cv.error.10 = rep(0, 10)
    for(i in 1:10){
      glm.fit = glm(mpg ~ poly(horsepower, i), data = Auto)
      cv.error.10[i] = cv.glm(data = Auto, glmfit = glm.fit, K = 10)$delta[1]
      # much shorter running time than that of LOOCV
    
    }
    
    # still cubic or higher order terms don't seem superior
    
    plot(cv.error.10, type = "b")
    

    5.3.4. The Bootstrap

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    ## accuracy of regression model --> standard error of parameter estimated
    
    ## bootstrap can be applied to compute the se
    
    
    library(boot)
    
    ## 1. create a function to compute statistics of interest
    
    alpha.fn = function(data, index){
      X = data$X[index]
      Y = data$Y[index]
      return((var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y)))
    }
    
    ## 2. boot()
    
    
    
    ## compute an alpha value (a kind of statistics)
    
    set.seed(1)
    alpha.fn(Portfolio, sample(100, 100, replace = TRUE)) 
    
    ## [1] 0.7368375
    
    1
    2
    
    ## repeat 1000 times with bootstrap
    
    boot(Portfolio, alpha.fn, R = 1000) 
    
    ## 
    ## ORDINARY NONPARAMETRIC BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##      original       bias    std. error
    ## t1* 0.5758321 -0.001695873  0.09366347
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # se(alpha_hat) = 0.5758321 +- 0.08861826
    
    
    
    ## linear regression coef. estimates
    
    boot.fn = function(data, index){
      return(coef(lm(mpg ~ horsepower, data = data, subset = index)))
    }
    
    boot.fn(data = Auto, index = 1:392)
    
    ## (Intercept)  horsepower 
    ##  39.9358610  -0.1578447
    
    1
    2
    
    set.seed(1)
    boot.fn(data = Auto, index = sample(392, 392, replace = TRUE))
    
    ## (Intercept)  horsepower 
    ##  40.3404517  -0.1634868
    
    1
    
    boot.fn(data = Auto, index = sample(392, 392, replace = TRUE))
    
    ## (Intercept)  horsepower 
    ##  40.1186906  -0.1577063
    
    1
    2
    3
    
    # with replacement --> the result can vary
    
    
    boot(Auto, boot.fn, 1000)
    
    ## 
    ## ORDINARY NONPARAMETRIC BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = Auto, statistic = boot.fn, R = 1000)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##       original        bias    std. error
    ## t1* 39.9358610  0.0544513229 0.841289790
    ## t2* -0.1578447 -0.0006170901 0.007343073
    
    1
    2
    
    # compare to
    
    summary(lm(mpg ~ horsepower, data = Auto))$coef
    
    ##               Estimate  Std. Error   t value      Pr(>|t|)
    ## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
    ## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    # se estimates are more accurate in boot() than in lm() (see ISLR 196p)
    
    
    ## better fit --> more similar se between boot() and summary()
    
    boot.fn = function(data, index){
      coefficients(lm(mpg ~ horsepower + I(horsepower ^ 2), 
                      data = data, 
                      subset = index))
    }
    
    set.seed(1)
    boot(Auto, boot.fn, 1000)
    
    ## 
    ## ORDINARY NONPARAMETRIC BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = Auto, statistic = boot.fn, R = 1000)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##         original        bias     std. error
    ## t1* 56.900099702  3.511640e-02 2.0300222526
    ## t2* -0.466189630 -7.080834e-04 0.0324241984
    ## t3*  0.001230536  2.840324e-06 0.0001172164
    
    1
    2
    
    summary(lm(mpg ~ horsepower + I(horsepower ^ 2), 
                      data = Auto))$coef
    
    ##                     Estimate   Std. Error   t value      Pr(>|t|)
    ## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
    ## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
    ## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21
    

    7. Moving Beyond Linearity

    7.4. Regression Splines

    7.4.1. Piecewise Polynomials

    knot๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ ์„œ polynomial regression fitting์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
    knot์—์„œ ๊ฐ ๊ณก์„ ์ด ํˆญํˆญ ๋Š์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

    7.4.2. Constraints and Splines

    ํˆญํˆญ ๋Š๊ธฐ์ง€ ์•Š๊ณ  ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์ด์–ด์ง€๋Š” spline curve๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์•„๋ž˜์˜ constraints๋ฅผ ์ถ”๊ฐ€ (์‚ผ์ฐจ์‹์ผ ๊ฒฝ์šฐ)

    • knot์—์„œ ํ•จ์ˆซ๊ฐ’
    • knot์—์„œ ๋ฏธ๋ถ„๊ณ„์ˆ˜
    • knot์—์„œ ์ด์ฐจ๋ฏธ๋ถ„๊ณ„์ˆ˜

    ์ผ์ฐจ์‹์ด๋ผ๋ฉด ์ฒซ ๋ฒˆ์งธ ์กฐ๊ฑด๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•˜๋‹ค.

    8. Tree-Based Methods

    • tree-based methods = decision tree methods : ์„ค๋ช…๋ณ€์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค
    • regression, classification ๋ชจ๋‘ ๊ฐ€๋Šฅํ•˜๋‹ค
    • ํ•ด์„์ด (๊ทธ๋‚˜๋งˆ) ์šฉ์ดํ•˜๋‹ค
    • ๋‹จ์ˆœํ•œ tree ๊ตฌ์กฐ๋งŒ์œผ๋กœ๋Š” ๋‹ค๋ฅธ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์— ๋น„ํ•ด ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋‚ฎ๋‹ค
    • bagging, random forests, boosting ์ด์šฉ –> ๋ชจํ˜•์ด ๋ณต์žกํ•ด์ง„๋‹ค –> ์˜ˆ์ธก ์ •ํ™•๋„ <-> ํ•ด์„ ์šฉ์ด์„ฑ trade-off

    8.1. The Basics of Decision Trees

    8.1.1. Regression Trees

    decision tree์˜ ํ˜•ํƒœ์™€ ๊ด€๋ จ๋œ ์šฉ์–ด๋Š” ๊ฑฐ๊พธ๋กœ ๊ฝ‚์•„ ๋†“์€ ๋‚˜๋ฌด๋ฅผ ๋– ์˜ฌ๋ฆฌ๋ฉด ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๋‹ค.

    • terminal nodes (leaves): ๊ฐ€์žฅ ์•„๋ž˜ ์ ํ˜€์žˆ๋Š” ๊ตฌ์—ญ๋ณ„ y๊ฐ’
    • internal node: ๋ถ„๊ธฐ์ 
    • branch: ๊ฐ node๋ฅผ ์ž‡๋Š” ์„ 

    ๋‚˜๋ฌด ์ƒ๋‹จ์— ์œ„์น˜ํ•  ์ˆ˜๋ก ์˜ˆ์ธก์— ์˜ํ–ฅ์ด ํฐ ์กฐ๊ฑด(๋ถ„๊ธฐ์ )์ด๋‹ค.
    ์กฐ๊ฑด์— ๋”ฐ๋ผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๊ณ  ์œ„์™€ ๊ฐ™์ด ํ•ด์„์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— regression tree๋Š” ์‹œ๊ฐํ™”, ํ•ด์„์ด ๋ชจ๋‘ ์šฉ์ดํ•˜๋‹ค.

    \

    Recursive Binary Splitting

    regression tree๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•: 1. predictor์˜ ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆˆ๋‹ค, 2. ๊ฐ ๊ตฌ๊ฐ„๋ณ„๋กœ ์˜ˆ์ธก๊ฐ’(ํ‰๊ท , …)์„ ๊ตฌํ•œ๋‹ค.
    ๊ทธ๋ ‡๋‹ค๋ฉด ๊ตฌ๊ฐ„์€ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆŒ๊นŒ? –> ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๋‘”๋‹ค.

    • ๊ตฌ๊ฐ„๋ณ„ RSS๋ฅผ ๊ตฌํ•ด์„œ ๊ทธ ํ•ฉ๊ณ„๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋„๋ก ๊ตฌ์—ญ์„ ๋‚˜๋ˆˆ๋‹ค

    ํ•˜์ง€๋งŒ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๊ฒฝ์šฐ์˜ ์ˆ˜์—์„œ ์œ„์˜ ๊ณ„์‚ฐ์„ ํ•˜๋ ค๋ฉด ์–‘์ด ๋„ˆ๋ฌด ๋งŽ๋‹ค.
    ๊ทธ๋ž˜์„œ top-down, greedy approach ํ˜น์€ recursive binary splitting ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•œ๋‹ค.

    • top-down: ์ƒ์œ„ ๋…ธ๋“œ(๊ตฌ์—ญ ๊ฐฏ์ˆ˜๊ฐ€ 1๊ฐœ์ธ ์ƒํƒœ)์—์„œ ์‹œ์ž‘ํ•˜๊ธฐ ๋•Œ๋ฌธ
    • greedy: ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆŒ ๋•Œ๋งˆ๋‹ค ์ง€๊ธˆ๋ณด๋‹ค ๋‚˜์€ ๋ฐฉ๋ฒ•์ด ์•„๋‹Œ ์ตœ์ƒ์˜ ๋ฐฉ๋ฒ•(best split)์„ ์ฐพ๊ธฐ ๋•Œ๋ฌธ

    ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘๋™ํ•œ๋‹ค.

    1. ๋จผ์ € ์–ด๋–ค X~j~๋ฅผ ์–ด๋Š ๊ฐ’(s)์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ ์•ผ ๋‘ ๊ตฌ์—ญ์˜ RSS ํ•ฉ์ด ์ตœ์†Œ๊ฐ€ ๋ ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.
    2. ๋‘ ๊ตฌ์—ญ์œผ๋กœ ๋‚˜๋ˆ ์ง„๋‹ค.
    3. ๋‘ ๊ตฌ์—ญ ์ค‘ ์–ด๋Š ๊ตฌ์—ญ์„ ์–ด๋–ค j, s๋กœ ๋‚˜๋ˆ ์•ผ ์„ธ ๊ตฌ์—ญ์˜ RSS ํ•ฉ์ด ์ตœ์†Œ๊ฐ€ ๋ ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.
    4. ์„ธ ๊ตฌ์—ญ์ด ๋œ๋‹ค.
    5. ์„ธ ๊ตฌ์—ญ ์ค‘ ์–ด๋Š ๊ตฌ์—ญ์„ ์–ด๋–ค j, s๋กœ ๋‚˜๋ˆ ์•ผ ๋„ค ๊ตฌ์—ญ์˜ RSS ํ•ฉ์ด ์ตœ์†Œ๊ฐ€ ๋ ์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.
    6. ๋„ค ๊ตฌ์—ญ์ด ๋œ๋‹ค.
    7. ํŠน์ • ์กฐ๊ฑด์„ ์ถฉ์กฑํ•  ๋•Œ๊นŒ์ง€ ๊ณ„์† ๊ตฌ์—ญ์„ ๋‚˜๋ˆˆ๋‹ค.
      • ํŠน์ • ์กฐ๊ฑด์˜ ์˜ˆ: ๋ชจ๋“  ๊ตฌ์—ญ์˜ ๊ด€์ธก๊ฐ’์ด 5๊ฐœ ๋ฏธ๋งŒ์ด๋‹ค

    \

    Tree Pruning

    recursive binary splitting ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ overffing์ด ๋ฐœ์ƒํ•˜์—ฌ ์ •์ž‘ test data์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์•ˆ ์ข‹์„ ์ˆ˜ ์žˆ๋‹ค.
    ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด RSS๊ฐ€ ์ผ์ •๊ฐ’ ์ด์ƒ ๊ฐ์†Œํ•˜์ง€ ์•Š์œผ๋ฉด ๊ตฌ์—ญ์„ ๋‚˜๋ˆ„์ง€ ์•Š๋Š”๋‹ค!๋Š” ๊ทœ์น™์„ ๋„ฃ์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ,
    ๊ทธ ์ดํ›„์— RSS๊ฐ€ ํ™• ๊ฐ์†Œํ•˜๋Š” ๊ตฌํš์ด ๊ฐ€๋Šฅํ–ˆ์—ˆ๋‹ค๋ฉด ์ด๋ฅผ ๋†“์น˜๊ฒŒ ๋œ๋‹ค.
    ๋”ฐ๋ผ์„œ recursive binary splitting์œผ๋กœ ์•„์ฃผ ํฐ ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“  ํ›„ **๊ฐ€์ง€์น˜๊ธฐ(pruning)**ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํƒํ•œ๋‹ค.
    ๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์ง€๋ฅผ ์ณ๋‚ผ๊นŒ?
    ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฑด test error rate๋ฅผ ์ค„์ด๋Š” ๊ฑด๋ฐ, ๊ทธ๋ ‡๋‹ค๋ฉด ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ subtree์—์„œ ์ด๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„ ๋น„๊ตํ•ด์•ผ ํ• ๊นŒ?

    –> Cost complexity pruning (or weakest link pruning)์„ ์‚ฌ์šฉํ•œ๋‹ค!
    ์›๋ž˜ ๋ชฉ์ ํ•จ์ˆ˜(RSS)์— terminal node ๊ฐฏ์ˆ˜์— ํŒจ๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ํ•ญ($\alpha$|T|)์„ ์ถ”๊ฐ€ํ–ˆ๋‹ค.
    |T|์€ subtree์˜ terminal node ๊ฐฏ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
    $\alpha$๋งˆ๋‹ค ๋Œ€์‘ํ•˜๋Š” subtree๊ฐ€ ์กด์žฌํ•œ๋‹ค.
    $\alpha$๊ฐ€ 0์ด๋ผ๋ฉด subtree๋Š” ์›๋ž˜ ๊ฐ€์žฅ ์ปค๋‹ค๋ž€ ๋‚˜๋ฌด์™€ ๊ฐ™๊ฒ ๊ณ ,
    $\alpha$๊ฐ€ 0๋ณด๋‹ค ํด ์ˆ˜๋ก |T|์— ๋ชฉ์ ํ•จ์ˆ˜๊ฐ’์ด ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›๊ฒŒ ๋˜๋Š”๋ฐ(์ฆ๊ฐ€),
    ์ด ์˜ํ–ฅ์„ ์—†์• ๋ ค๋ฉด RSS๊ฐ€ ๊ฝค ๊ฐ์†Œํ•˜๋Š” ๊ตฌํš์ด ์•„๋‹Œ ํ•œ |T|๋ฅผ ์ค„์—ฌ์•ผ ํ•  ๊ฒƒ์ด๋‹ค.

    1. recursive binary splitting์œผ๋กœ ์ปค๋‹ค๋ž€ ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ ๋‹ค.
    2. K-fold CV ๋ฐฉ๋ฒ•์œผ๋กœ test error rate๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” $\alpha$๋ฅผ ๊ฒฐ์ •ํ•œ ํ›„,
    3. ์ด $\alpha$์— ๋Œ€์‘ํ•˜๋Š” subtree๋ฅผ ์ตœ์ข… ๋ชจํ˜•์œผ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค.

    8.1.2. Classification Trees

    Regression tree์™€ ๋‹ค๋ฅธ ์ 

    • ๋ฐ˜์‘๋ณ€์ˆ˜๊ฐ€ ๋ฒ”์ฃผํ˜•(์„ค๋ช…๋ณ€์ˆ˜๋„ ๋ฒ”์ฃผํ˜•์ผ ์ˆ˜ ์žˆ๋‹ค)
    • terminal node์—๋Š” ๊ทธ ๊ตฌ์—ญ์˜ ํ‰๊ท ๊ฐ’์ด ์•„๋‹ˆ๋ผ ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋ฒ”์ฃผ
    • ๊ตฌ์—ญ๋งˆ๋‹ค ๊ณ„์‚ฐํ•  criteron์€ RSS ๋Œ€์‹  ๋‹ค๋ฅธ ๊ฒƒ
      • ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋ฒ”์ฃผ์— ์†ํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ
      • ์œ„ ๋น„์œจ ๋ณด๋‹ค๋Š” Gini index๋‚˜ Entropy(Shannon index)๋ฅผ ๋” ๋งŽ์ด ์“ด๋‹ค.
      • ์œ„ ๋‘ ์ง€์ˆ˜๋Š” ๋ชจ๋‘ impurity๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ทธ ๊ตฌ์—ญ์ด ๋‹ค์–‘ํ•œ ๋ฒ”์ฃผ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉด ๊ฐ’์ด
        ์ปค์ง„๋‹ค.
    • predicted classification์ด ๋‚˜๋ˆ ์ง€์ง€ ์•Š๋Š” node๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค(๋‚˜๋ˆ ๋„ ๋‘ ๊ตฌ์—ญ ๋ชจ๋‘ ๊ฐ™์€ ๋ฒ”์ฃผ)
      • Gini index๋‚˜ Entropy๋ฅผ ๊ตฌํš์˜ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
      • ์œ„์™€ ๊ฐ™์ด ๊ตฌ์—ญ์„ ๋‚˜๋ˆ„๋ฉด ๋ชฉ์ ํ•จ์ˆ˜๊ฐ’์ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์—(error rate๋Š” ๊ฐ™์ง€๋งŒ) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ƒ node๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

    8.1.3. Trees Versus Linear Models

    ๋ฌด์—‡์ด ๋” ๋‚˜์€์ง€๋Š” ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค.

    test error rate๊ฐ€ ํ•œ์ชฝ์ด ๋” ๋‚˜์„ ์ˆ˜ ์žˆ๊ณ ,
    interpretability๊ฐ€ ํ•œ์ชฝ์ด ๋” ๋‚˜์„ ์ˆ˜ ์žˆ๋‹ค.

    8.1.4. Advantages and Disadvantages of Trees

    (8.1.3์ด๋ž‘ ๋ญ๊ฐ€ ๋‹ฌ๋ผ)

    (vs classical regression methods)
    (tree methods ์ž…์žฅ์—์„œ)

    ์žฅ์ 

    • ๋น„์ „๋ฌธ๊ฐ€์—๊ฒŒ๋„ ์„ค๋ช…ํ•˜๊ธฐ ์‰ฝ๋‹ค
    • ์ธ๊ฐ„์˜ ์˜์‚ฌ๊ฒฐ์ •๊ณผ์ •์„ ๋” ๋น„์Šทํ•˜๊ฒŒ ๋ชจ์‚ฌํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค
    • ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ๋‹ค๋ฃจ๊ธฐ ๋” ์‰ฝ๋‹ค(๋”๋ฏธ๋ณ€์ˆ˜ ๋‹ค๋ฃจ๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ์‰ฝ๋‹ค)

    ๋‹จ์ 

    • ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์€ ํŽธ์ด๋‹ค(bagging, random forests, boosting์œผ๋กœ ํ•ด๊ฒฐ)
    • non-robust. ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ tree ๊ตฌ์กฐ๊ฐ€ ๋ณ€ํ•˜๊ธฐ ์‰ฝ๋‹ค

    8.2. Bagging, Random Forests, Boosting

    8.2.1. Bagging

    Bagging = Bootstrap aggregation
    statistical learning method์˜ variance๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” general-purpose procecure์ด์ง€๋งŒ
    decision tree ๊ด€๋ จํ•ด์„œ ์ž์ฃผ ์“ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ๊ธฐ์„œ ์†Œ๊ฐœํ•œ๋‹ค.

    ์›๋ฆฌ

    • ํ‰๊ท ์˜ ๋ถ„์‚ฐ์€ ๋ถ„์‚ฐ์„ n์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ –> ํ‰๊ท ์„ ๊ตฌํ•˜๋ฉด ๋ถ„์‚ฐ์ด ๋‚ฎ์•„์ง„๋‹ค!
    • resampling ์—ฌ๋Ÿฌ๋ฒˆ ํ•ด์„œ training set ์—ฌ๋Ÿฌ๊ฐœ ์ƒ์„ฑ –> ๊ฐ set๋งˆ๋‹ค tree ์ƒ์„ฑ –> tree ์˜ˆ์ธก๊ฐ’ ํ‰๊ท 
    • (training set ์ƒ์„ฑ์— bootstrap์ด ์‚ฌ์šฉ๋œ๋‹ค)
    • ๋ฒ”์ฃผํ˜•์ผ ๊ฒฝ์šฐ ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ์˜ˆ์ธก ๋ฒ”์ฃผ๋กœ ๊ฒฐ์ •(๋‚˜๋ฌดํ•œํ…Œ ํˆฌํ‘œ๊ถŒ ๋ถ€์—ฌ!)
    • ์ƒ์„ฑํ•  training set ๊ฐฏ์ˆ˜๋Š” ํฌ๊ฒŒ ์ค‘์š”์น˜ x. ๋งค์šฐ ์ปค๋„ overfitting ์—†๋‹ค. ์ถฉ๋ถ„ํžˆ ํฐ ๊ฐ’์œผ๋กœ ๊ณ ๋ฅด์ž.

    (bootstrap์ด๋ž‘ ๋ญ๊ฐ€ ๋‹ฌ๋ผ)

    \

    Out-of-Bag Error Estimation

    bagged model์˜ test error ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•: ๊ฐ ๋ชจ๋ธ์— CV ์ˆ˜ํ–‰ํ•ด์„œ ๊ฒฐ๊ณผ ๋น„๊ตํ•  ํ•„์š” ์—†๋‹ค!
    ๊ฐ bagged tree๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์ค‘ ์•ฝ 2/3๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.
    ์ด 2/3์— ๋“ค์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ๋ฅผ out-of-bag(OOB) observation์ด๋ผ ๋ถ€๋ฅธ๋‹ค.
    ์ „์ฒด bagged tree๊ฐ€ B๊ฐœ ์žˆ๋‹ค๋ฉด, OOB obs. ํ•œ ๊ฐœ๋‹น ์•ฝ 3/B๊ฐœ์˜ prediction์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
    ์ด 3/B๊ฐœ ๊ฐ’์„ ํ‰๊ท (for regression)ํ•˜๊ฑฐ๋‚˜ ํˆฌํ‘œ(for classification)ํ•˜๋ฉด OOB MSE๋‚˜ classification error๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
    ์ด๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ(OOB)๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์œผ๋ฏ€๋กœ ์œ ํšจํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

    ์ด๊ฒŒ bootstrap + 3-fold CV๋ž‘ ๋‹ค๋ฅผ ๊ฒŒ ๋ญ์•ผ

    \

    Variable Importance Measures

    bagging์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ด์ œ ๋” ์ด์ƒ ํ•ด์„ํ•˜๊ธฐ ์ˆ˜์›”ํ•œ ๋ชจ๋ธ์ด ์•„๋‹ˆ๊ฒŒ ๋œ๋‹ค.
    ๋Œ€์‹  bagged tree๋ฅผ ์ด์šฉํ•ด ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก๋ ฅ์ด ๋†’์•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

    • ๊ฐ predictor๋งˆ๋‹ค ๊ตฌํš ํ›„ ๊ฐ์†Œํ•œ RSS(or Gini index)๋ฅผ ๋”ํ•ด์„œ ๋น„๊ตํ•œ๋‹ค.
    • ํ•œ predictor๊ฐ€ ์—ฌ๋Ÿฌ node์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๋ฉด ๊ฐ node์—์„œ ๋ฐœ์ƒํ•œ RSS(or Gini index) ๊ฐ์†Œ๋Ÿ‰์„ ๋ชจ๋‘ ๋”ํ•ด์„œ ๋น„๊ตํ•œ๋‹ค.

    8.2.2. Random Forests

    Bagging < Random Forests

    ๋งŒ์•ฝ ์–ด๋–ค predictor๊ฐ€ ํŠนํžˆ ์˜ํ–ฅ๋ ฅ์ด ๋†’๋‹ค๋ฉด,
    bagging์—์„œ๋Š” ์ด predictor๋งŒ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌ์—ญ์„ ๋‚˜๋ˆ„๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.
    ๊ทธ ๊ฒฐ๊ณผ bagged tree๋Š” ์„œ๋กœ ๊ฑฐ์˜ ์ฐจ์ด๊ฐ€ ์—†์„ ๊ฒƒ์ด๊ณ  ์˜ˆ์ธก ๊ฒฐ๊ณผ๋„ ๋น„์Šทํ•  ๊ฒƒ์ด๋‹ค.
    ์ด๋Ÿฌํ•œ highly correlated quantities๋ฅผ ํ‰๊ท ํ•ด๋„ variance reduction ์–‘์€ ๋ณ„๋กœ ๋˜์ง€ ์•Š๋Š”๋‹ค.

    ๋•Œ๋ฌธ์— Randomforest๋Š”

    • decorrelating์œผ๋กœ variance reduction์„ ๋†’์ธ๋‹ค.
      • ๋งค ๊ตฌํš๋งˆ๋‹ค ๋ช‡ ๊ฐœ์˜ predictor๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋ฝ‘์•„์„œ ์ด๊ฒƒ๋งŒ ๊ณ ๋ คํ•œ๋‹ค.
      • ์ฆ‰ ๋ชจ๋“  predictor๋ฅผ ๊ณตํ‰ํ•˜๊ฒŒ ๊ฒฝ์Ÿ์‹œํ‚ค๋Š” ๊ฒŒ ์•„๋‹ˆ๋‹ค.
      • ๋•Œ๋ฌธ์— ์˜ํ–ฅ๋ ฅ ๋‚ฎ์€ predictor๋„ node์— ์ฐธ์—ฌํ•  ๊ธฐํšŒ๊ฐ€ ๋งŽ์•„์ง€๊ณ 
      • resulted tree๋„ ๋”์šฑ ๋‹ค์–‘ํ•ด์ง„๋‹ค(decorrelated).
      • ๋ณดํ†ต m=sqrt(p) ๋งŒํผ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค.
      • ํ‰๊ท ์ ์œผ๋กœ (p-m)/p ๋งŒํผ์˜ node๊ฐ€ ์ตœ๊ฐ• predictor๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.
      • bagging๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋‚˜๋ฌด ๊ฐฏ์ˆ˜๋Š” ํฌ๊ฒŒ ์ค‘์š”x. ์ถฉ๋ถ„ํžˆ ๋งŽ์ด ๋งŒ๋“ค์ž.
    ์™œ ์˜ํ–ฅ๋ ฅ์ด ๋†’์ง€ ์•Š์€ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€? ์™œ RF ๊ฒฐ๊ณผ๊ฐ€ ๋” ์ข‹์•„์ง€์ง€?
    
    Averaging many highly correlated data does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting
    
    correlated quantities๋ฉด ํ‰๊ท ํ•ด๋„ variance๊ฐ€ ๋ณ„๋กœ ์ค„์–ด๋“ค์ง€ ์•Š๋Š”๋‹ค?  
    ๊ทผ๋ฐ 
    bagging์€ highly correlated & highly accurate์ธ ๋ฐ˜๋ฉด
    RF๋Š” less correlated & less accurate๊ฐ€ ์•„๋‹๊นŒ?  
    ๊ทธ๋ž˜์„œ variance reduction์€ RF๊ฐ€ ํฌ์ง€๋งŒ accuracy๋Š” bagging์ด ๋” ํฌ๋‹ค? (๊ทผ๋ฐ ํ˜„์‹ค์€ RF๊ฐ€ ๋” ์ •ํ™•)
    
    there by making the average of the resulting trees less variable and hence more reliable
    
    decorrelated values๋ฅผ ํ‰๊ท ํ•˜๋Š” ๊ฒŒ ์™œ less variable?
    

    8.2.3. Boosting

    Bagging๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ decision tree ์™ธ ๋‹ค๋ฅธ ๊ธฐ๋ฒ•์—๋„ ๋‘๋ฃจ ์“ฐ์ผ ์ˆ˜ ์žˆ๋‹ค.

    ์ž‘๋™ ์›๋ฆฌ

    • NULL ๋ชจ๋ธ(fx=0), NULL residual(r~i~=y~i~) ์ƒ์„ฑ
    • ์ž‘์€ ๋‚˜๋ฌด ํ•œ ๊ทธ๋ฃจ ์ƒ์„ฑ, NULL ๋ชจ๋ธ์— ๋”ํ•˜๊ธฐ, residual ๊ณ„์‚ฐ
    • ์ž‘์€ ๋‚˜๋ฌด ํ•œ ๊ทธ๋ฃจ ์ƒ์„ฑ, ๊ธฐ์กด ๋‚˜๋ฌด์— ๋”ํ•˜๊ธฐ, residual ๊ณ„์‚ฐ
    • ์ž‘์€ ๋‚˜๋ฌด ํ•œ ๊ทธ๋ฃจ ์ƒ์„ฑ, ๊ธฐ์กด ๋‚˜๋ฌด์— ๋”ํ•˜๊ธฐ, residual ๊ณ„์‚ฐ

    ๋‚˜๋ฌด ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•ฉ์น˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ,
    ํ•œ ๊ทธ๋ฃจ๋ฅผ ๊ณ„์† ํ‚ค์šฐ๋Š” ๋ฐฉ์‹. learn slowly.

    ๊ด€๋ จ ๋ชจ์ˆ˜ 3๊ฐœ

    • B: ๋งŒ๋“ค ์ด ๋‚˜๋ฌด ์ˆ˜. ๋„ˆ๋ฌด ํฌ๋ฉด overffiting ๋  ์ˆ˜ ์žˆ๋‹ค. CV ์‚ฌ์šฉํ•ด์„œ B ๊ฒฐ์ •ํ•œ๋‹ค.
    • $\lambda$: shrinkage parameter. ์ƒˆ๋กœ ๋งŒ๋“  ์ž‘์€ ๋‚˜๋ฌด๋ฅผ ๊ธฐ์กด ๋‚˜๋ฌด์— ๋”ํ•˜๊ธฐ ์ „์— ์ด ์ˆซ์ž๋ฅผ ๊ณฑํ•ด์ค€๋‹ค.
      ์ฆ‰ ๋‚˜๋ฌด๊ฐ€ ์ž๋ผ๋Š”(learning) ์†๋„๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.
      • ๋ณดํ†ต ์ฒœ์ฒœํžˆ ๋ฐฐ์šธ ์ˆ˜๋ก ์ตœ์ข… ์„ฑ๋Šฅ์€ ์ข‹์•„์ง„๋‹ค.
      • 0.01์ด๋‚˜ 0.001์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
    • d: ์ƒˆ๋กœ ๋งŒ๋“ค ์ž‘์€ ๋‚˜๋ฌด์˜ ๋‚˜๋ญ‡์žŽ ๊ฐœ์ˆ˜.
      • ๋ณดํ†ต 1๋กœ ๋‘๋Š”๋ฐ, ์ด๋Ÿฌ๋ฉด ๊ฐ ๋‚˜๋ฌด๋Š” ๊ฐ€์ง€๊ฐ€ 1๊ฐœ ๋ฟ์ด๋‹ˆ stump๊ฐ€ ๋œ๋‹ค. (stump๋Š” ์–ด์šธ๋ฆฌ๋Š” ์šฉ์–ด๊ฐ€ ์•„๋‹Œ๋“ฏ)
    bagging, RF, Boosting์ด ๋Œ€๋žต ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค๋Š” ๋Š๋‚Œ์€ ์˜จ๋‹ค๋งŒ
    ๊ทธ ๊ณผ์ •์„ ๋šœ๋ ทํ•˜๊ฒŒ ๊ทธ๋ฆฌ์ง€๋Š” ๋ชป ํ•˜๊ฒ ๋‹ค.
    
    bagging, RF์—์„œ ์ตœ์ข… ๋ชจํ˜•์€ ํ‰๊ท ๊ฐ’์„ ์“ด๋‹ค๋Š” ๋ฐ, ์ด๊ฒŒ ๋ฌด์Šจ ๋ง์ด์ง€?
    Boosting์—์„œ ๊ฐ ์ค‘๊ฐ„ ๋ชจ๋ธ(๋‚˜๋ฌด)๋ฅผ ๊ธฐ์กด ๊ฒƒ์— ๋”ํ•œ๋‹ค๋Š”๋ฐ ์ด '๋”ํ•œ๋‹ค'๋Š” ๊ฒŒ ๋ฌด์Šจ ๋ง์ด์ง€?
    Boosting์—์„œ residual update๋Š” ์™œ ํ•˜์ง€? ์–ด๋–ค ํŒ์ • ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ๋„ ์•„๋‹ˆ๊ณ .
    

    8.3. Lab: Decision Trees

    8.3.1. Fitting Classification Trees

    1
    2
    3
    4
    5
    6
    7
    8
    
    library(tree)
    library(ISLR)
    attach(Carseats)
    
    ## create a "High" column
    
    High = ifelse(Sales <= 8, "No", "Yes")
    Carseats = data.frame(Carseats, High)
    str(Carseats)
    
    ## 'data.frame':	400 obs. of  12 variables:
    ##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
    ##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
    ##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
    ##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
    ##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
    ##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
    ##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
    ##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
    ##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
    ##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
    ##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
    ##  $ High       : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 2 1 1 ...
    
    1
    2
    3
    
    ## try to fit a classification tree
    
    tree.carseats = tree(High ~ . -Sales, Carseats)
    summary(tree.carseats) 
    
    ## 
    ## Classification tree:
    ## tree(formula = High ~ . - Sales, data = Carseats)
    ## Variables actually used in tree construction:
    ## [1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
    ## [6] "Advertising" "Age"         "US"         
    ## Number of terminal nodes:  27 
    ## Residual mean deviance:  0.4575 = 170.7 / 373 
    ## Misclassification error rate: 0.09 = 36 / 400
    
    1
    2
    3
    4
    5
    6
    
    # training error rate = 9%
    
    # deviance: smaller is good
    
    # residual mean deviance: deviance / (n - |T|)
    
    
    plot(tree.carseats) # plot tree
    
    text(tree.carseats, pretty = 0) # add text
    
    

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    ## test error rate
    
    set.seed(2)
    train = sample(1:nrow(Carseats), 200)
    Carseats.test = Carseats[-train, ]
    High.test = High[-train]
    tree.carseats = tree(High ~ . -Sales, Carseats, subset = train)
    tree.pred = predict(object = tree.carseats, 
                        newdata = Carseats.test,
                        type = "class") # specifying the tree is for classification
    
    
    # test error rate = diagonal / whole
    
    sum(diag(table(tree.pred, High.test))) / sum(table(tree.pred, High.test))
    
    ## [1] 0.77
    
    1
    2
    3
    4
    5
    
    ## pruninig (cost complexity pruning)
    
    set.seed(3)
    cv.carseats = cv.tree(tree.carseats, 
                          FUN = prune.misclass) # for classification problem
    
    names(cv.carseats) 
    
    ## [1] "size"   "dev"    "k"      "method"
    
    1
    2
    3
    4
    
    # size: # of leaf
    
    # k: alpha in eq 8.4 (cost for tree complexity)
    
    # dev: cv error rate
    
    cv.carseats
    
    ## $size
    ## [1] 21 19 14  9  8  5  3  2  1
    ## 
    ## $dev
    ## [1] 74 76 81 81 75 77 78 85 81
    ## 
    ## $k
    ## [1] -Inf  0.0  1.0  1.4  2.0  3.0  4.0  9.0 18.0
    ## 
    ## $method
    ## [1] "misclass"
    ## 
    ## attr(,"class")
    ## [1] "prune"         "tree.sequence"
    
    1
    2
    3
    
    par(mfrow = c(1, 2))
    plot(cv.carseats$size, cv.carseats$dev, type = "b")
    plot(cv.carseats$k, cv.carseats$dev, type = "b")
    

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    # tree with 9 leaves is turned out to be the best tree
    
    # fit the tree explicitly
    
    prune.carseats = prune.misclass(tree.carseats, best = 9) 
    # best: # of leaves. result of cost-complexity 
    
    
    plot(prune.carseats)
    text(prune.carseats, pretty = 0)
    
    tree.pred = predict(prune.carseats, Carseats.test, type = "class")
    sum(diag(table(tree.pred, High.test))) / sum(table(tree.pred, High.test))
    
    ## [1] 0.775
    
    1
    
    # pruned tree has simpler structure and more accurate prediction
    
    

    8.3.2. Fitting Regression Trees

    1
    2
    3
    4
    5
    6
    7
    
    ## fit a regreession tree
    
    library(MASS)
    set.seed(1)
    train = sample(1:nrow(Boston), nrow(Boston) / 2)
    tree.boston = tree(medv ~ ., Boston, subset = train)
    
    summary(tree.boston)
    
    ## 
    ## Regression tree:
    ## tree(formula = medv ~ ., data = Boston, subset = train)
    ## Variables actually used in tree construction:
    ## [1] "rm"    "lstat" "crim"  "age"  
    ## Number of terminal nodes:  7 
    ## Residual mean deviance:  10.38 = 2555 / 246 
    ## Distribution of residuals:
    ##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    ## -10.1800  -1.7770  -0.1775   0.0000   1.9230  16.5800
    
    1
    2
    
    plot(tree.boston)
    text(tree.boston)
    

    1
    2
    3
    
    ## prune
    
    cv.boston = cv.tree(tree.boston)
    plot(cv.boston$size, cv.boston$dev, type = "b")
    

    1
    2
    
    prune.boston = prune.tree(tree.boston, best = 5)
    summary(prune.boston)
    
    ## 
    ## Regression tree:
    ## snip.tree(tree = tree.boston, nodes = 5L)
    ## Variables actually used in tree construction:
    ## [1] "rm"    "lstat"
    ## Number of terminal nodes:  5 
    ## Residual mean deviance:  13.69 = 3396 / 248 
    ## Distribution of residuals:
    ##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    ## -10.1800  -1.9770  -0.1775   0.0000   2.4230  16.5800
    
    1
    2
    
    plot(prune.boston)
    text(prune.boston, pretty = 0)
    

    1
    2
    3
    4
    
    yhat = predict(tree.boston, newdata = Boston[-train, ])
    boston.test = Boston[-train, "medv"]
    plot(yhat, boston.test)
    abline(0, 1)
    

    1
    
    mean((yhat - boston.test) ^ 2) # MSE
    
    
    ## [1] 35.28688
    
    1
    
    # here, pruned tree has simpler structure but it has larger MSE
    
    

    8.3.3. Bagging and Random Forests

    • (8.3.2์˜ decision tree ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•ด๋ณด๊ธฐ)
    • randomForest ํŒจํ‚ค์ง€ ํ•˜๋‚˜๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค(Bagging์€ random forest ์ค‘ m=p์ธ a special case).
    1
    2
    
    ## perform bagging
    
    library(randomForest)
    
    ## randomForest 4.6-14
    
    ## Type rfNews() to see new features/changes/bug fixes.
    
    1
    2
    3
    4
    5
    6
    
    set.seed(1)
    bag.boston = randomForest(medv ~ . , data = Boston, 
                              subset = train,
                              mtry = 13, # ๊ตฌํšํ•  ๋•Œ ์ƒ˜ํ”Œ์ˆ˜๋งํ•  ์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜
    
                              importance = TRUE)
    bag.boston
    
    ## 
    ## Call:
    ##  randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,      subset = train) 
    ##                Type of random forest: regression
    ##                      Number of trees: 500
    ## No. of variables tried at each split: 13
    ## 
    ##           Mean of squared residuals: 11.39601
    ##                     % Var explained: 85.17
    
    1
    2
    3
    4
    
    ## get test error rate
    
    yhat.bag = predict(bag.boston, newdata = Boston[-train, ])
    plot(yhat.bag, boston.test)
    abline(0, 1)
    

    1
    
    mean((yhat.bag - boston.test) ^ 2) 
    
    ## [1] 23.59273
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    # 13.50808 (bagging) vs 26.83413 (a regression tree)
    
    
    ## change the number of tree to grow
    
    bag.boston = randomForest(medv ~ ., 
                              data = Boston,
                              subset = train,
                              mtry = 13,
                              ntree = 25)
    yhat.bag = predict(bag.boston, 
                       newdata = Boston[-train, ])
    mean((yhat.bag - boston.test) ^ 2) 
    
    ## [1] 23.66716
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    ## random forests
    
    set.seed(1)
    rf.boston = randomForest(medv ~., 
                             data = Boston,
                             subset = train,
                             mtry = 6,
                             importance = TRUE)
    
    yhat.rf = predict(rf.boston,
                      newdata = Boston[-train, ])
    mean((yhat.rf - boston.test) ^ 2)
    
    ## [1] 19.62021
    
    1
    2
    3
    
    # 11.40256 (rf) vs 13.50808 (bagging) vs 26.83413 (a regression tree)
    
    
    importance(rf.boston)
    
    ##           %IncMSE IncNodePurity
    ## crim    16.697017    1076.08786
    ## zn       3.625784      88.35342
    ## indus    4.968621     609.53356
    ## chas     1.061432      52.21793
    ## nox     13.518179     709.87339
    ## rm      32.343305    7857.65451
    ## age     13.272498     612.21424
    ## dis      9.032477     714.94674
    ## rad      2.878434      95.80598
    ## tax      9.118801     364.92479
    ## ptratio  8.467062     823.93341
    ## black    7.579482     275.62272
    ## lstat   27.129817    6027.63740
    
    1
    2
    3
    4
    
    # %IncMSE: ๊ทธ ๋ณ€์ˆ˜๊ฐ€ ์—†์„ ๋•Œ ๋ชจ๋ธ ์ •ํ™•๋„๊ฐ€ ์ค„์–ด๋“œ๋Š” ์ •๋„(MSE๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์ •๋„?)
    
    # IncNodePurity: ๊ทธ ๋ณ€์ˆ˜๊ฐ€ ์—†์„ ๋•Œ Gini index๊ฐ€ ์ค„์–ด๋“œ๋Š” ์ •๋„(purity๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์ •๋„)
    
    # node impurity: regression์€ RSS๋กœ, classification์€ deviance๋กœ ๊ณ„์‚ฐ
    
    varImpPlot(rf.boston)
    

    1
    
    # rm and lstat์˜ ์˜ํ–ฅ์ด ๊ฐ€์žฅ ํฌ๋”๋ผ
    
    

    8.3.4. Boosting

    1
    2
    
    ## fit a boosted model
    
    library(gbm)
    
    ## Loaded gbm 2.1.5
    
    1
    2
    3
    4
    5
    6
    7
    8
    
    set.seed(1)
    boost.boston = gbm(medv ~ ., 
                       data = Boston[train, ],
                       distribution = "gaussian", # binary classification --> "bernoulli"
    
                       n.trees = 5000, # # of tree to grow
    
                       interaction.depth = 4) # depth (= # of node layer) of each tree
    
    
    summary(boost.boston)
    

    ##             var    rel.inf
    ## rm           rm 43.9919329
    ## lstat     lstat 33.1216941
    ## crim       crim  4.2604167
    ## dis         dis  4.0111090
    ## nox         nox  3.4353017
    ## black     black  2.8267554
    ## age         age  2.6113938
    ## ptratio ptratio  2.5403035
    ## tax         tax  1.4565654
    ## indus     indus  0.8008740
    ## rad         rad  0.6546400
    ## zn           zn  0.1446149
    ## chas       chas  0.1443986
    
    1
    2
    3
    4
    
    # rel.inf: relative influence
    
    
    par(mfrow = c(1, 2))
    plot(boost.boston, i = "rm")
    

    1
    
    plot(boost.boston, i = "lstat")
    

    1
    2
    3
    4
    5
    
    ## predict
    
    yhat.boost = predict(boost.boston,
                         newdata = Boston[-train, ],
                         n.trees = 5000)
    mean((yhat.boost - boston.test) ^ 2)
    
    ## [1] 18.84709
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    # 10.81479 (boost) vs 11.40256 (rf) vs 13.50808 (bagging) vs 26.83413 (a regression tree)
    
    
    ## manipulate shrinkage parameter
    
    boost.boston = gbm(medv ~ ., 
                       data = Boston[train, ],
                       distribution = "gaussian",
                       n.trees = 5000,
                       interaction.depth = 4,
                       shrinkage = 0.2, # default = 0.001
    
                       verbose = FALSE)
    
    yhat.boost = predict(boost.boston,
                         newdata = Boston[-train, ],
                         n.trees = 5000)
    mean((yhat.boost - boston.test) ^ 2) # vs 10.81479 (wih lambda = 0.001)
    
    
    ## [1] 18.33455
    

    9. Support Vector Machines

    Support vector machine (SVM)

    • 1990๋…„๋Œ€์— ์ปด์‹ธ ๋ถ„์•ผ์—์„œ ๊ฐœ๋ฐœํ–ˆ๋‹ค.
    • one of the best “out of the box” classifier.
      • “out of box”: ์™„์ œํ’ˆ? ํŠน๋ณ„ํ•œ ์ž‘์—… ์—†์ด ๋ฐ”๋กœ ์จ๋จน์„ ์ˆ˜ ์žˆ๋Š” classifier.
    • maximal margin classifier์˜ ์ผ๋ฐ˜ํ™” ๋ฒ„์ „
      • maximal margin classifier: linear boundary ์žˆ์„ ๋•Œ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
      • support vector classifier: extension
      • support vector machine: further extension
    • class๊ฐ€ 2๊ฐœ์ธ classification ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๋งŒ๋“ค์—ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ 2๊ฐœ๋ณด๋‹ค ๋” ๋งŽ์€ class๋ฅผ ๊ฐ€์ง€๋Š” ๋ฌธ์ œ, ์‹ฌ์ง€์–ด regression์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    9.1. Maximal Margin Classifier

    9.1.1. What Is a Hyperplane?

    • Hyperplane: p์ฐจ์› ๊ณต๊ฐ„์—์„œ๋Š” (p-1) ์ฐจ์›์˜ flat affine subspace๊ฐ€ ๋œ๋‹ค.
      • affine space: ๋ฒกํ„ฐ๊ณต๊ฐ„์— ์ (point)์˜ ๊ฐœ๋…์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ. ์›์ ์€ ๋”ฐ๋กœ ์กด์žฌํ•˜์ง€ ์•Š์Œ.
      • ์˜ˆ๋กœ, 2์ฐจ์› ๊ณต๊ฐ„์—์„œ๋Š” ์ง์„ ์ด ๋˜๊ณ , 3์ฐจ์› ๊ณต๊ฐ„์—์„œ๋Š” 2์ฐจ์› ํ‰๋ฉด์ด ๋œ๋‹ค.
      • 4์ฐจ์› ์ด์ƒ์ด๋ฉด ์‹œ๊ฐํ™”๊ฐ€ ๊ณค๋ž€ํ•˜์ง€๋งŒ ๊ฐœ๋…์€ ๋˜‘๊ฐ™๋‹ค.
    • ์ˆ˜์‹์œผ๋กœ๋Š”, $\beta$~0~+$\beta$~1~X~1~+$\beta$~2~X~2~+…+$\beta$~p~X~p~=0์„ ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋“  X=(X~1~, …, X~p~)^T^๋กœ ์“ธ ์ˆ˜ ์žˆ๋‹ค.
      • Hyperplane์€ p์ฐจ์› ๊ณต๊ฐ„์„ ๋‘ ์˜์—ญ์œผ๋กœ ๋‚˜๋ˆ„๊ฒŒ ๋œ๋‹ค(์ขŒ๋ณ€์ด 0๋ณด๋‹ค ํฐ ๊ณณ, ์ž‘์€ ๊ณณ)

    9.1.2. Classification Using a Separating Hyperplane

    • Hyperplane์ด ๋งŒ๋“œ๋Š” ๋‘ ๊ตฌ์—ญ์˜ class๋ฅผ ๊ฐ๊ฐ 1, -1๋กœ indexing.
    • separating hyperplane: ํ•œ ๊ตฌ์—ญ์— ํ•œ class๋งŒ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” hyperplane.
      • y~i~($\beta$~0~+$\beta$~1~X~1~+$\beta$~2~X~2~+…+$\beta$~p~X~p~)>0์„ ๋งŒ์กฑํ•œ๋‹ค(y~i๋Š” ํ•ด๋‹น class์˜ index).
      • ์™œ๋ƒํ–๋ฉด ์ขŒ๋ณ€>0์ธ class๋Š” 1๋กœ indexing, ์ขŒ๋ณ€<0์ธ class๋Š” -1๋กœ indexingํ•˜๊ธฐ ๋•Œ๋ฌธ.
    • ์ขŒ๋ณ€ ์ ˆ๋Œ“๊ฐ’์ด 0๋ณด๋‹ค ํด์ˆ˜๋ก hyperplane์—์„œ ๋ฉ€์–ด์ง„๋‹ค –> ์–ด๋Š ๊ตฌ์—ญ์— ์†ํ• ์ง€ ๋” ์‰ฝ๊ฒŒ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋‹ค.

    9.1.3. The Maximal Margin Classifier

    Maximal margin classifier (or optimal separating hyperplane)

    • hyperplane์„ ํ•˜๋‚˜ ์ฐพ์•˜๋‹ค๋ฉด, ์ด๋ฅผ ์‚ด์ง์‚ด์ง ์›€์ง์ด๋ฉด์„œ ๋ฌด์ˆ˜ํžˆ ๋งŽ์€ ๋†ˆ๋“ค์„ ๋” ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.
    • ์ด ์ค‘ ์–ด๋–ค ๋†ˆ์„ ์„ ํƒํ•ด์•ผ ํ• ๊นŒ?
    • ๋‘ ๊ทธ๋ฃน ์‚ฌ์ด๋ฅผ ์•„์Šฌ์•„์Šฌํ•˜๊ฒŒ ์Šค์ณ๊ฐ€๋“ฏ ๋‚˜๋ˆ„๊ธฐ๋ณด๋‹ค๋Š” ์ตœ๋Œ€ํ•œ ๋ฉ€์ฐŒ๊ฐ์น˜ ๋‚˜๋ˆ„๊ณ  ์‹ถ๋‹ค.
    • hyperplane์—์„œ ๊ฐ point์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ–ˆ์„ ๋•Œ, ์ด ๊ฑฐ๋ฆฌ์˜ ์ตœ์†Ÿ๊ฐ’์„ margin์ด๋ผ ํ•œ๋‹ค.
    • margin์ด ์ตœ๋Œ€์ธ hyperplane = maximal margin classifier
    • p๊ฐ€ ์ปค์ง€๋ฉด overfittingํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

    ์—ฌ๊ธฐ์„œ support vector๋Š” maximal margin์— ์œ„์น˜ํ•˜๋Š” points๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

    • support: maximal margin hyperplane ์„ ํƒ์— ๋„์›€์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ
    • vector: ๊ฐ point๋Š” p์ฐจ์› ๊ณต๊ฐ„์˜ ๋ฒกํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ

    ์ด์— ๋”ฐ๋ผ maximual margin hyperplane์€ support vector์—๋งŒ ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ ๋œ๋‹ค.
    ๋ฐ˜๋Œ€๋กœ ๋งํ•˜๋ฉด support vector ์™ธ ๋‹ค๋ฅธ ์ ๋“ค์ด maximal margin ๋ฐ– ์–ด๋””๋ฅผ ๋›ฐ์–ด๋Œ•๊ฒจ๋„ hyperplane์€ ์›€์ง์ด์ง€ ์•Š๋Š”๋‹ค.

    9.1.4. Construction of the Maximal Margin Classifier

    the Maximal Margin Classifier๋ฅผ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์„ธ ๋ชฉ์ ํ•จ์ˆ˜

    1. maximize M
      • margin์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค. the Maximal Margin Classifier์˜ ์ •์˜.
    2. ๋ชจ๋“  ์ ์—์„œ y~i~($\beta$~0~+$\beta$~1~X~1~+$\beta$~1~X~2~+…+$\beta$~1~X~p~)>=M
    3. $\sum_{j = 1}^{p}$$\beta$~*j*~^2^=1
      • 2์™€ 3์ด ํ•ฉ์ณ์ ธ์„œ –> ๋ชจ๋“  ๊ด€์ธก๊ฐ’์€ ๊ทธ์— ๋งž๋Š” ๊ตฌ์—ญ์— ์œ„์น˜ํ•˜๊ณ  margin์€ M์ด์ƒ์ด์–ด์•ผ ํ•œ๋‹ค.
      • p์ฐจ์› ๊ณต๊ฐ„์˜ ํ•œ ์ ์—์„œ hyperplane๊นŒ์ง€ ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋ฉด ๋ถ„์ž๋Š” 2์˜ ์ขŒ๋ณ€(y~i~ ์ œ์™ธํ•œ), ๋ถ„๋ชจ๋Š” sqrt($\sum_{j = 1}^{p}$$\beta$~*j*~^2^)๊ฐ€ ๋œ๋‹ค.
      • ์กฐ๊ฑด 3์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— 2์˜ ์šฐ๋ณ€์ด M์ผ ์ˆ˜ ์žˆ๋‹ค(์—†๋‹ค๋ฉด ์ € ๋ถ„๋ชจ ๊ฐ’์„ ๊ณ ๋ คํ•ด์„œ ์šฐ๋ณ€์„ ๋ฐ”๊ฟ”์•ผ๊ฒ ์ง€?)

    ์œ„ ์„ธ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๋‘๊ณ  ์–ด์ฐŒ์–ด์ฐŒ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€์–ด hyperplane์„ ๊ตฌํ•œ๋‹ค(์ž์„ธํ•œ ๊ฑด ์ƒ๋žต)

    9.1.5. The Non-separable Case

    ์œ„ ์ตœ์ ํ™” ๋ฌธ์ œ์˜ ํ•ด๋‹ต(=maximal margin classifier)์€ ์ปค๋…•
    separating hyperplane ์กฐ์ฐจ ๊ตฌํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
    \

    –> ๋‘ ๊ตฌ์—ญ์„ ๋”ฑ๋”ฑ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ์–ผ์ถ” ๊ตฌ๋ถ„ํ•ด๋‚ด๋Š” soft margin์„ ์ ์šฉํ•œ๋‹ค.
    –> support vector classifier(m.m.c.๋ฅผ non-separatable case๋กœ ํ™•์žฅ) ๋“ฑ์„ ์‚ฌ์šฉํ•œ๋‹ค.

    9.2. Support Vector Classifiers

    9.2.1. Overview of the Support Vector Classifier

    Maximal Margin Classifier๋Š” ๋ชจ๋“  ๊ด€์ธก๊ฐ’์„ ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ ํ•ด๋‚ด๋ ค๊ณ  ํ•œ๋‹ค.
    ์ด๋Ÿฐ hyperplane์€ ์กด์žฌํ•˜๊ธฐ๋„ ํž˜๋“ค๋ฟ๋”๋Ÿฌ,
    ์กด์žฌํ•œ๋‹ค๊ณ  ํ•ด๋„ ์•„๋ž˜์™€ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ง€๋Š” ๋งค์šฐ ์˜ˆ๋ฏผํ•œ ๋†ˆ์ผ ๊ฒƒ์ด๋‹ค.

    • ๊ด€์ธก๊ฐ’ ํ•˜๋‚˜ํ•˜๋‚˜์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•œ๋‹ค(์™„๋ฒฝ์ฃผ์˜)
    • maximal margin์ด ๋งค์šฐ ์ž‘์•„์งˆ ์ˆ˜ ์žˆ๋‹ค(์•„์Šฌ์•„์Šฌ) –> ๋ถˆ-ํŽธ
    • overfitting ๊ฐ€๋Šฅ์„ฑ(์™„๋ฒฝ์ฃผ์˜)

    ๋”ฐ๋ผ์„œ

    • ๋Œ€๋ถ€๋ถ„์˜ ๊ด€์ธก๊ฐ’์„ ์ž˜ ๋ถ„๋ฅ˜ํ•ด๋‚ด๋ฉด์„œ Margin๋„ ์ผ์ • ๊ฐ’ ์ด์ƒ์„ ์œ ์ง€ํ•˜๋Š”
    • ๋•Œ๋ฌธ์— ๊ด€์ธก๊ฐ’ ํ•˜๋‚˜ํ•˜๋‚˜์— ์‰ฝ๊ฒŒ ๋ณ€ํ•˜์ง€ ์•Š๋Š”(๋œ ๋ฏผ๊ฐํ•œ, robust)

    ๋†ˆ์ด ํ•„์š”ํ•˜๋‹ค.
    –> support vector classifier (or soft margin classifier)

    9.2.2. Details of the Support Vector Classifier

    the Support Vector Classifier๋ฅผ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ชฉ์ ํ•จ์ˆ˜

    1. maximize M
      • margin์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค.
    2. $\sum_{j = 1}^{p}$$\beta$~*j*~^2^ = 1
    3. ๋ชจ๋“  ์ ์—์„œ y~i~($\beta$~0~ + $\beta$~1~X~1~ + $\beta$~1~X~2~ + … + $\beta$~1~X~p~) >= M(1-$\epsilon$~i~)
    4. $\epsilon$~i~ >= 0, $\sum_{i=1}^{n}$$\epsilon$~*i*~ <= *C*

    Maximal Margin Classifier์˜ ๋ชฉ์ ํ•จ์ˆ˜์™€ ๋น„๊ต

    • 1, 2๋ฒˆ ์กฐ๊ฑด์€ ๊ฐ™๋‹ค.
      • 1๋ฒˆ์€ margin์„ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ
      • 2๋ฒˆ์€ ์กฐ๊ฑด 3์˜ ์šฐ๋ณ€์„ M์œผ๋กœ ๋‘๊ธฐ ์œ„ํ•œ ๋ณด์กฐ ์žฅ์น˜
    • 3๋ฒˆ ์กฐ๊ฑด์€ ์šฐ๋ณ€์ด M(1-$\epsilon$~i~)๋กœ ๋ฐ”๋€Œ์—ˆ๊ณ , 4๋ฒˆ ์กฐ๊ฑด์€ ์ƒˆ๋กœ ์ถ”๊ฐ€๋๋‹ค.
      • 3๋ฒˆ์˜ $\epsilon$~i~์€ slack variable
      • 4๋ฒˆ์˜ C๋Š” tolerance

    Maximal Margin Classifier ๋ณด๋‹ค ๋Š์Šจํ•œ classifier๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด tuning parameter C(tolerance)๋ฅผ ์ถ”๊ฐ€ํ–ˆ๋‹ค.
    hyperplane๊ณผ margin์„ ๊ทธ์–ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ–ˆ์„ ๋•Œ,

    • slack์ด 0์ด๋ฉด margin ์ด์ƒ์˜ ๊ฑฐ๋ฆฌ์—์„œ ์ž˜ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ (๋‚จ๋ฐฉํ•œ๊ณ„์„  ์นจ๋ฒ”ํ•˜์ง€ ์•Š๊ณ  ๋‚จํ•œ์— ์œ„์น˜)
    • slack์ด 0๋ณด๋‹ค ํฌ๋ฉด margin ๋ฏธ๋งŒ์˜ ๊ฑฐ๋ฆฌ์—์„œ ์ž˜ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ (๋‚จํ•œ์ธก DMZ์— ์œ„์น˜)
    • slack์ด 1๋ณด๋‹ค ํฌ๋ฉด ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ (ํƒˆ๋‚จ. ๋ถํ•œ DMZ or ๋ถํ•œ ์˜ํ†  ์นจ๋ฒ”)

    ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    C๋Š” slack์˜ budget์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. C๋งŒํผ์˜ ํฌ์ธํŠธ๊ฐ€ ์žˆ๊ณ  slack ๊ฐ’์— ๋”ฐ๋ผ ํฌ์ธํŠธ๊ฐ€ ์ฐจ๊ฐ๋˜๋Š” ํ˜•์‹์ด๋‹ค. ํฌ์ธํŠธ๊ฐ€ 0์ด๋ผ๋ฉด slack์€ ๋ชจ๋‘ 0์ด์–ด์•ผ ํ•˜๊ณ (= Maximal Margin Classifier) ํฌ์ธํŠธ๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก ๋” ๋งŽ์€, ํ˜น์€ ๋” ๊ฐ’์ด ํฐ slack์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ ์˜๋ฏธ์—์„œ C๋ฅผ SVC์˜ tolerance๋ผ ๋ถ€๋ฅด๋Š” ๊ฒƒ์ด๋‹ค.

    Support Vector Classifier์—์„œ๋„ hyperplane์„ ๊ฒฐ์ •ํ•  ๋•Œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฐ’๋“ค์„ support vector๋ผ ํ•˜๋Š”๋ฐ Maximal margin classifier ๋•Œ์™€๋Š” ๋Œ€์ƒ์ด ์กฐ๊ธˆ ๋‹ค๋ฅด๋‹ค.

    • Maximal Margin Classifier: maximal margin์— ์œ„์น˜ํ•˜๋Š” points($\epsilon$ = 0)
    • Support Vector Classifier: maximal margin์— ํ˜น์€ ๊ทธ๋ณด๋‹ค ๋” ๋ฐ˜๋Œ€ ์˜์—ญ์— ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•˜๋Š” points ($\epsilon$ >= 0)

    C๊ฐ€ ํด ์ˆ˜๋ก support vector๊ฐ€ ๋” ๋งŽ์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— hyperplane์˜ variance๊ฐ€ ์ž‘์•„์ง€๊ณ  bias๋Š” ์ปค์ง„๋‹ค.
    C๊ฐ€ ์ž‘์„ ์ˆ˜๋ก support vector๊ฐ€ ๋” ๋งŽ์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— hyperplane์˜ variance๊ฐ€ ์ปค์ง€๊ณ , bias๋Š” ์ž‘์•„์ง„๋‹ค.
    ์ฆ‰, C๋ฅผ ์ด์šฉํ•ด Classifier์˜ bias-variance trade-off๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.
    C๋Š” ๋‹ค๋ฅธ classifier์—์„œ ์‚ฌ์šฉํ•œ ๊ฒƒ์ฒ˜๋Ÿผ cross-validation์œผ๋กœ ์ถ”์ •ํ•œ๋‹ค.

    Support vecotr classifier๋Š” ์ „์ฒด ๊ด€์ธก๊ฐ’ ์ค‘ ์ผ๋ถ€(support vector)๋งŒ ์‚ฌ์šฉํ•ด ๋ถ„๋ฅ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ด€์ธก์น˜๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•(e.g. linear discriminant analysis)๋ณด๋‹ค๋Š” ๊ด€์ธก๊ฐ’(๋˜๋Š” ์ด์ƒ๊ฐ’)์— ๋œ ๋ฏผ๊ฐํ•˜๋‹ค.

    9.3. Support Vector Machines

    • 9.3.1: non-linear boundary์ผ ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์†Œ๊ฐœ
    • 9.3.2: ์œ„ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ž๋™ํ™”ํ•œ ๋ฐฉ๋ฒ• ์†Œ๊ฐœ(SVM)

    9.3.1. Classification with Non-liinear Decision Boundaries

    key = Enlarging the feature space

    • ์ง์„ ์œผ๋กœ ๋ถ€์กฑํ•˜๋ฉด ๊ณก์„ ์„ ๋งŒ๋“ ๋‹ค(2์ฐจํ•ญ, 3์ฐจํ•ญ, … ์ถ”๊ฐ€).
    • ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค(predictor+๊ณ„์ˆ˜๋งŒ ๋Š˜์–ด๋‚  ๋ฟ)

    ํ•ญ์„ ๋Š˜๋ฆฌ๋ฉด ๊ณ„์‚ฐ๋Ÿ‰๋„ ๋ถ€์ฉ๋ถ€์ฉ ๋Š”๋‹ค.

    SVM๋ฅผ ์ด์šฉํ•˜๋ฉด ํšจ์œจ์ ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋„๋ก feature space๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

    9.3.2. The Support Vector Machine

    SVM์€ feature space๋ฅผ ํ™•์žฅํ•  ๋•Œ kernel์„ ์ด์šฉํ•œ๋‹ค(์ž์„ธํ•œ ๊ณ„์‚ฐ์€ ๋ณต์žกํ•˜๋‹ค. ์ง€๊ธˆ์€ feature space๋ฅผ ํ™•์žฅํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ž).

    Support vector classfier์˜ ๋ชฉ์  ํ•จ์ˆ˜์—๋Š” ๊ด€์ธก์น˜์˜ inner product์™ธ ๋‹ค๋ฅธ ๊ณ„์‚ฐ์€ ํฌํ•จํ•˜์ง€ ์•Š์•˜๋‹ค.

    • inner product = <x~i~, x~*i’*~> = $\sum_{j=1}^{p}$x~*ij*~x~*i’j*~

    ์ด๋Š” inner product๋งŒ์œผ๋กœ๋„ support vector classifier์˜ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค.

    • f(x) = $\beta$~0~ + $\sum_{i=1}^{n}$$\alpha$~i~<*x*~*i*~, *x*~*i’*~>

    ์—ฌ๊ธฐ์„œ $\alpha$~i~๋Š” ๊ฐ ๊ด€์ธก์น˜๋งˆ๋‹ค ๋ถ€์—ฌํ•˜๋Š” weighting factor ์ •๋„์˜ ์˜๋ฏธ๊ฐ€ ๋œ๋‹ค.
    support vector๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด hyperplane ๊ฒฐ์ •์— ์˜ํ–ฅ์ด ์—†์œผ๋ฏ€๋กœ $\alpha$~i~ = 0์ด ๋˜๊ณ ,
    support vector์—์„œ๋Š” ์ผ์ • ๊ฐ’์ด ์ •ํ•ด์ง„๋‹ค(์›๋ž˜ ์‹์—์„œ๋Š” ์ด ๊ฐ’์ด $\beta$ ๊ณ„์ˆ˜๋“ค์ด ๋œ๋‹ค).

    $\alpha$~i~๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”$\binom{n}{2}$๊ฐœ์˜ <x~i~, x~*i’*~>๋ฅผ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋œ๋‹ค.

    • margin์„ ๊ฒฐ์ •ํ•˜๋ ค๋ฉด ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ•œ ์Œ์˜ ๊ด€์ธก์น˜ ์กฐํ•ฉ๋งˆ๋‹ค ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ๋น„๊ตํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ

    ์œ„์˜ inner product ๊ด€๋ จ ํ•ญ์„ **์ผ๋ฐ˜ํ™”(generalization)**ํ•˜์—ฌ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•ด๋ณด์ž.

    • K(x~i~, x~*i’*~)

    ์œ„์˜ K๊ฐ€ ๋ฐ”๋กœ kernel์ด๋‹ค. kernel์€ ๋‘ ๊ด€์ธก์น˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•œ์ง€(ํ˜น์€ ๋‹ค๋ฅธ์ง€)๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ํ‘œํ˜„ํ•ด์ค€๋‹ค(๊ฑฐ๋ฆฌ).
    kernel์ด inner product ํ•ญ์„ **์ผ๋ฐ˜ํ™”(generalization)**ํ–ˆ๋‹ค๊ณ  ํ‘œํ˜„ํ•œ ์ด์œ ๋Š” ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ inner product ํ‘œํ˜„์ด ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
    ์ฆ‰, SVM์— ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ kernel์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ๋‹ค.

    • K(x~i~, x~*i’*~) = $\sum_{j=1}^{p}$x~*ij*~x~*i’j*~ (linear kernel)
    • K(x~i~, x~*i’*~) = (1 + $\sum_{j=1}^{p}$x~*ij*~x~*i’j*~)^*d*^ (polynomial kernel)
    • K(x~i~, x~*i’*~) = exp(-$\gamma$$\sum_{j=1}^{p}$(*x*~*ij*~ - *x*~*i’j*~)^2^) (radial kernel)
      • $\gamma$๊ฐ€ ํด์ˆ˜๋ก non-linear

    polynomial kernel์—์„œ d๊ฐ€ 1์ด๋ฉด linear kernel์ด ๋œ๋‹ค(1์€ ์ƒ์ˆ˜ํ•ญ์ด๊ธฐ ๋•Œ๋ฌธ์— $\beta$~0~์— ํก์ˆ˜๋œ๋‹ค).

    ์œ„์™€ ๊ฐ™์ด ๋‹ค์–‘ํ•œ kernel์„ ๋ชจ๋‘ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด support vector machine์ด๊ณ ,
    SVM ์ค‘ linear kernel์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ(ํ˜น์€ d=1์ธ polynomial kernel)๊ฐ€ support vector classifier์ธ ๊ฒƒ์ด๋‹ค.

    SVC์™€ SVM์„ ๋‹ค๋ฅธ ๊ด€์ ์—์„œ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ ๊ฐ™์•„ ํ—ท๊ฐˆ๋ฆฐ๋‹ค.
    
    support vector classifier์—์„œ๋Š” hyperplane์—์„œ support vector๊นŒ์ง€์˜ "๊ฑฐ๋ฆฌ" ๊ฐœ๋…์œผ๋กœ ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ๋‚ดํฌํ•˜๋Š” ์˜๋ฏธ๋ฅผ ์„ค๋ช…ํ•ด์คฌ๋‹ค.
    
    ๋ฐ˜๋ฉด support vector machine์—์„œ๋Š” inner product๋ฅผ ์ด์šฉํ•ด ๋ชฉ์ ํ•จ์ˆ˜์˜ ๊ผด์„ ์„ค๋ช…ํ•ด์คฌ๋‹ค.
    
    ๊ฒŒ๋‹ค๊ฐ€ 
    SVC์—์„œ๋Š” ํ•œ ์ ๊ณผ hyperplane๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ–ˆ๋Š”๋ฐ,
    SVM์—์„œ๋Š” ๋‘ ์ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ํ•˜์ง€๋งŒ ๋‘ ์ ์˜ ๊ฑฐ๋ฆฌ๋งŒ ๋‹ค๋ฃจ์ง€๋Š” ์•Š๋Š” ๊ฒƒ ๊ฐ™๋‹ค.
    xi์™€ xi'๋Š” ๋‘ ๊ด€์ธก๊ฐ’์˜ ๊ฑฐ๋ฆฌ์˜€๊ณ , x์™€ xi๋Š” ์ƒˆ๋กœ์šด ๊ด€์ธก๊ฐ’๊ณผ training points ๊ฐ„์˜ ๊ฑฐ๋ฆฌ์˜€๋‹ค.
    (์—ฌ๊ธฐ์„œ training points๋Š” hyperplane(ํ˜น์€ margin?)์„ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค)
    
    SVM์—์„œ๋„ SVC์™€ ๊ฐ™์€ ๊ฐœ๋…(๊ฑฐ๋ฆฌ์™€ margin์„ ๋น„๊ต)์ด ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๋ฉด
    SVM์˜ kernel์„ 
    "๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ํ•จ์ˆ˜"๋ผ๊ณ  ์ดํ•ดํ•˜๋Š” ๊ฒŒ ํŽธํ•  ๊ฒƒ ๊ฐ™๋‹ค.
    SVC์—์„œ ์ง์„ ๊ผด์˜ hyperplane๊ณผ ํ•œ ์ ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” kernel์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ฒ˜๋Ÿผ
    SVM์—์„œ๋Š” ๊ณก์„ ๊ผด์˜ hyperplane๊ณผ ํ•œ ์ ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” (๋ณต์žกํ•˜๊ฒŒ ์ƒ๊ธด) kernel์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด๋‹ค.
    

    9.3.3. An Application to the Heart Disease Data

    9.4. SVMs with More than Two Classes

    SVM์„ K-class case๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์—” ๋Œ€ํ‘œ์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

    • One-versus-one (or all-pairs) approach
    • One-versus-all approach

    9.4.1. One-Versus-One Classification

    K classes ์ค‘ ๋‘ ๊ฐœ classes๋ฅผ ์„ ํƒํ•ด์„œ binary classificationํ•œ๋‹ค.

    • $\binom{k}{2}$๊ฐœ SVM์„ ๋งŒ๋“ ๋‹ค
    • ๊ฐ SVM์€ K classes ์ค‘ kth, *k’*th class์— ๋Œ€ํ•œ two-class setting ๋ฌธ์ œ๋ฅผ ํ’€๊ฒŒ ๋œ๋‹ค
    • $\binom{k}{2}$๊ฐœ SVM์„ ๊ฐ€์ง€๊ณ  ๊ฐ ๊ด€์ธก์น˜๋ฅผ ์–ด๋Š class๋กœ ๋ถ„๋ฅ˜ํ—€๋Š”์ง€ ํˆฌํ‘œํ•œ๋‹ค
    • ๊ฐ€์žฅ ๋งŽ์€ ํ‘œ๋ฅผ ์–ป์€ class๊ฐ€ ๊ฐ ๊ด€์ธก์น˜์˜ ์ตœ์ข… ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๊ฐ€ ๋œ๋‹ค

    9.4.2. One-Versus-All Classification

    K classes ์ค‘ kth class๋ฅผ ์„ ํƒํ•ด์„œ (kth class) vs (๋‚˜๋จธ์ง€ k-1 classes)์— ๋Œ€ํ•ด binary classificationํ•œ๋‹ค.

    ๊ด€์ธก์น˜ x๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋ ค ํ•œ๋‹ค๋ฉด,

    • K๊ฐœ SVM์„ ๋งŒ๋“ ๋‹ค
    • ๊ฐ SVM์˜ kernel์—์„œ ๊ณ„์‚ฐํ•œ ๊ฐ’(=๊ฑฐ๋ฆฌ) ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๊ฐ€์ง„ SVM์˜ class K๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค
      • ๊ฐ’์ด ํด ์ˆ˜๋ก ๋” ์•ˆ์ •์ ์œผ๋กœ ํ•ด๋‹น class์— ์†ํ•˜๊ธฐ ๋•Œ๋ฌธ
    ๋ณธ๋ฌธ์—์„œ๋Š” linear kernel ์ˆ˜์‹์œผ๋กœ๋งŒ ์„œ์ˆ ํ–ˆ๋Š”๋ฐ ๊ทธ๋ƒฅ ์˜ˆ๋ฅผ ๋“  ๊ฑฐ๊ฒ ์ง€?
    

    9.5. Relationship to Logistic Regression

    SVM์ด ์ฒ˜์Œ ๋‚˜์˜จ 1990๋…„๋Œ€์—๋Š” hyperplane๊ณผ kernel ์ด์šฉํ•œ๋‹ค๋Š” ์ ์ด ์ฐธ์‹ ํ•˜๊ฒŒ ๋Š๊ปด์กŒ๋‹ค๊ณ  ํ•œ๋‹ค.
    ํ•˜์ง€๋งŒ SVM์€ ๊ธฐ์กด classical classification approaches์™€ ๊ถค๋ฅผ ๊ฐ™์ดํ•œ๋‹ค๋Š” ์ ์ด ๋ฐํ˜€์กŒ๋‹ค.

    support vector classifier์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ์“ธ ์ˆ˜ ์žˆ๋‹ค(๋˜?).

    • $\beta$~p~์— ๋Œ€ํ•ด minimize{ $\sum_{i=1}^{n}$ max[0, 1 - *y*~*i*~*f*(*x*~*i*~)] + $\lambda$$\sum_{j=1}^{p}$$\beta$^2^~j~ }
      • $\lambda$๋Š” variance-bias trade-off๋ฅผ ์กฐ์ ˆํ•˜๋Š” tuning parameter๋‹ค(SVC์˜ C)

    ์œ„์—์„œ minimize ์•ˆ์€ Loss + Penalty ํ˜•ํƒœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
    SVM์—์„œ loss๋Š” ํ•œ ์ ์—์„œ hyperplane๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ, penalty๋Š” tolerance์ด๋‹ค.
    SVM์™€ ๊ฐ™์€ loss function์„ hinge loss๋ผ ํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ support vector๊ฐ€ ์•„๋‹Œ ๊ด€์ธก๊ฐ’์€ loss๊ฐ’์ด 0์ด ๋œ๋‹ค.

    logistic regression, linear discriminant analysis, lasso regression, ridge regression ์—ญ์‹œ Loss + Penalty ํ˜•ํƒœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์—ˆ๋‹ค.
    ๋˜ํ•œ ์ด๋“ค ๋ฐฉ๋ฒ•์—์„œ๋„ non-linear kernel์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹จ์ง€ historical reasons ๋•Œ๋ฌธ์— non-linear kernel์€ SVM์—์„œ ๋” ์ž์ฃผ ์‚ฌ์šฉํ•  ๋ฟ์ด๋‹ค.

    Logistic regression๊ณผ SVC๋Š” ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€๋ฐ,
    ์ด๋“ค์˜ y~i~($\beta$~0~+$\beta$~1~X~1~+$\beta$~2~X~2~+…+$\beta$~p~X~p~)์— ๋”ฐ๋ฅธ Loss function ๊ฐ’์ด ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    • loss๊ฐ€ ๋น„์Šทํ•˜๋‹ค = ๋น„์Šทํ•œ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค
    • ๋‹ค๋งŒ SVC์—์„œ๋Š” support vector๊ฐ€ ์•„๋‹ˆ๋ฉด loss๊ฐ€ 0์ธ ๋ฐ˜๋ฉด,
      logistic regressioin์—์„œ๋Š” loss๊ฐ€ 0์ธ ๊ฒฝ์šฐ๋Š” ์—†๋‹ค).

    Support vector regression vs least squares regression

    Loss function์ด ๋‹ค๋ฅด๋‹ค.

    • least squares regression: residual sum of squares๋ฅผ ์ตœ์†Œ๋กœํ•˜๋Š” $\beta$~0~, …, $\beta$~p~
    • support vector regression: ํŠน์ • ๊ฐ’ ์ด์ƒ์˜ ์ ˆ๋Œ“๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ด€์ธก์น˜๋งŒ loss function์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค(svc์™€ ๋น„์Šท)
    support vector regression ์ข€ ์ž์„ธํžˆ ์„ค๋ช…ํ•ด์ฃผ์ง€ --> 9.6.3์— ์กฐ๊ธˆ ๋‚˜์˜จใ„ท
    

    9.6. Lab: Support Vector Machines

    ์—ฌ๊ธฐ์„œ๋Š” e1071 ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
    LiblineaR ํŒจํ‚ค์ง€๋„ ์žˆ๋Š”๋ฐ, predictor๊ฐ€ ๊ฝค ๋งŽ์€ ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹๋‹ค.

    9.6.1. Support Vector Classifier

    svc: e1071::svm()

    • kernel: “linear"์ผ ๋•Œ๋Š” svm ์ค‘์—์„œ๋„ svc๊ฐ€ ๋œ๋‹ค(์ฑ…์—์„œ ์†Œ๊ฐœํ•œ ๊ฒƒ๊ณผ๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ํ‘ผ๋‹ค).
    • cost: tolerance ์กฐ์ ˆ ๋ชจ์ˆ˜. cost๊ฐ€ ์ž‘์œผ๋ฉด slack์ด ์ปค์ง€๊ณ  support vector๊ฐ€ ๋งŽ์•„์ง„๋‹ค.
    1
    2
    3
    4
    5
    6
    7
    
    # make a two-dimensional example data
    
    set.seed(1)
    x = matrix(rnorm(20 * 2), ncol = 2)
    y = c(rep(-1, 10), rep(1, 10))
    x[y == 1, ] = x[y == 1, ] + 1 # add 1 for last 10 data
    
    plot(x, 
         col = (3 - y)) # non-linear boundary 
    
    

    For classification problem,
    class variable should be in factor type.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    dat = data.frame(x = x, 
                     y = as.factor(y))
    library(e1071)
    svmfit = svm(y ~ ., 
                 data = dat,
                 kernel = "linear",
                 cost = 10, 
                 scale = FALSE) # don't do normalization
    
    plot(svmfit, dat)
    

    1
    2
    3
    4
    5
    
    # cross: support vector
    
    # circle: !(support vector)
    
    
    # return index of support vector
    
    svmfit$index
    
    ## [1]  1  2  5  7 14 16 17
    
    1
    
    summary(svmfit)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  10 
    ## 
    ## Number of Support Vectors:  7
    ## 
    ##  ( 4 3 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  -1 1
    
    1
    2
    
    # cost = 10
    
    # # of support vector = 7 (4 in class "-1")
    
    

    SVC with a smaller cost.
    tolerance๊ฐ€ ๋” ๋†’์•„์ง€๊ณ  margin์ด ๋„“์–ด์ง€๊ณ  support vector๊ฐ€ ๋” ๋งŽ์•„์งˆ ๊ฒƒ์ด๋‹ค.

    1
    2
    3
    4
    5
    6
    7
    
    svmfit = svm(y ~ .,
                 data = dat,
                 kernel = "linear",
                 cost = 0.1,
                 scale = FALSE)
    
    plot(svmfit, dat)
    

    1
    
    svmfit$index
    
    ##  [1]  1  2  3  4  5  7  9 10 12 13 14 15 16 17 18 20
    
    1
    
    summary(svmfit)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.1, scale = FALSE)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  0.1 
    ## 
    ## Number of Support Vectors:  16
    ## 
    ##  ( 8 8 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  -1 1
    

    tune(): ์›ํ•˜๋Š” ๋ชจ์ˆ˜ ๊ฐ’ ๋ฒ”์œ„์—์„œ 10-fold CV๋ฅผ ์ˆ˜ํ–‰ํ•ด์ค€๋‹ค.

    • error: tune.control() error.fun์—์„œ ์ง€์ •ํ•œ ํ•จ์ˆ˜๋กœ ์—๋Ÿฌ๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์—๋Ÿฌ๋ฅผ sampling.aggregate์—์„œ ์ง€์ •ํ•œ ํ•จ์ˆ˜๋กœ ์š”์•ฝํ•œ๋‹ค
      • error.fun ๊ธฐ๋ณธ๊ฐ’: classification์€ misclassification rate๊ฐ€, regression์—์„œ๋Š” MSE๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค
      • sampling.aggregate ๊ธฐ๋ณธ๊ฐ’: mean
    • dispersion: tune.control() error.fun์—์„œ ์ง€์ •ํ•œ ํ•จ์ˆ˜๋กœ ์—๋Ÿฌ๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์—๋Ÿฌ๋ฅผ sampling.dispersion์—์„œ ์ง€์ •ํ•œ ํ•จ์ˆ˜๋กœ ์š”์•ฝํ•œ๋‹ค
      • sampling.dispersion ๊ธฐ๋ณธ๊ฐ’: sd
    1
    2
    3
    4
    5
    6
    7
    
    set.seed(1)
    tune.out = tune(svm, 
                    y ~ .,
                    data = dat,
                    kernel = "linear",
                    ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
    summary(tune.out)
    
    ## 
    ## Parameter tuning of 'svm':
    ## 
    ## - sampling method: 10-fold cross validation 
    ## 
    ## - best parameters:
    ##  cost
    ##   0.1
    ## 
    ## - best performance: 0.05 
    ## 
    ## - Detailed performance results:
    ##    cost error dispersion
    ## 1 1e-03  0.55  0.4377975
    ## 2 1e-02  0.55  0.4377975
    ## 3 1e-01  0.05  0.1581139
    ## 4 1e+00  0.15  0.2415229
    ## 5 5e+00  0.15  0.2415229
    ## 6 1e+01  0.15  0.2415229
    ## 7 1e+02  0.15  0.2415229
    
    1
    2
    3
    
    # extract information of the best model 
    
    bestmod = tune.out$best.model
    summary(bestmod)
    
    ## 
    ## Call:
    ## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 
    ##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  0.1 
    ## 
    ## Number of Support Vectors:  16
    ## 
    ##  ( 8 8 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  -1 1
    

    predict(): ๋งŒ๋“ค์–ด๋†“์€ svm์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์˜ class๋ฅผ ์˜ˆ์ธก(regression์—์„œ๋Š” ๊ฐ’์„ ์˜ˆ์ธก)ํ•œ๋‹ค.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    set.seed(1)
    xtest = matrix(rnorm(20 * 2),
                   ncol = 2)
    ytest = sample(c(-1, 1), 20, rep = TRUE)
    xtest[ytest == 1, ] = xtest[ytest == 1, ] + 1
    testdat = data.frame(x = xtest,
                         y = as.factor(ytest))
    
    ypred = predict(bestmod,
                    testdat)
    table(predict = ypred,
          truth = testdat$y)
    
    ##        truth
    ## predict -1 1
    ##      -1  9 3
    ##      1   2 6
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    # with another cost (0.01)
    
    
    svmfit = svm(y ~ .,
                 data = dat,
                 kernel = "linear",
                 cost = 0.01,
                 scale = FALSE)
    ypred = predict(svmfit, testdat)
    table(predict = ypred,
          truth = testdat$y)
    
    ##        truth
    ## predict -1  1
    ##      -1 10  4
    ##      1   1  5
    

    Linearly separableํ•œ ๊ฒฝ์šฐ

    1
    2
    3
    4
    
    x[y == 1, ] = x[y == 1, ] + 0.5
    plot(x, 
         col = (y + 5) / 2,
         pch = 19)
    

    ์ด์ œ linearly separableํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋๋‹ค.
    cost๋ฅผ ๋†’์—ฌ์„œ(= tolerance๋ฅผ ์•„์ฃผ ๋‚ฎ์ถฐ์„œ) ์•„์ฃผ ์ฒ ์ €ํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•ด๋ณด์ž.

    1
    2
    3
    4
    5
    6
    7
    
    dat = data.frame(x = x,
                     y = as.factor(y))
    svmfit = svm(y ~ .,
                 data = dat,
                 kernel = "linear",
                 cost = 1e5)
    summary(svmfit)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1e+05)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  1e+05 
    ## 
    ## Number of Support Vectors:  3
    ## 
    ##  ( 1 2 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  -1 1
    
    1
    
    plot(svmfit, dat)
    

    training ๋ฐ์ดํ„ฐ๋ฅผ ์™„๋ฒฝํžˆ ๋ถ„๋ฅ˜ํ•ด๋ƒˆ์ง€๋งŒ,
    margin์ด ๋งค์šฐ ์ž‘๋‹ค(ํŒŒ๋ž€์ƒ‰ support vector์™€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋™๊ทธ๋ผ๋ฏธ ์‚ฌ์ด ๊ฑฐ๋ฆฌ๊ฐ€ ๋งค์šฐ ๊ฐ€๊น๋‹ค).
    ๋•Œ๋ฌธ์— test error๋Š” ๋†’์„ ์ˆ˜ ์žˆ๋‹ค.
    cost๋ฅผ ๋‚ฎ์ถฐ๋ณด์ž.

    1
    2
    3
    4
    5
    
    svmfit = svm(y ~ ., 
                 data = dat,
                 kernel = "linear",
                 cost = 1)
    summary(svmfit)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  1 
    ## 
    ## Number of Support Vectors:  7
    ## 
    ##  ( 4 3 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  -1 1
    
    1
    2
    
    plot(svmfit,
         dat)
    

    ๋ถ„ํ™์ƒ‰ ํ•œ ๊ฐœ๊ฐ€ ํ•˜๋Š˜์ƒ‰์œผ๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜๋์ง€๋งŒ, ๊ทธ ๋Œ€์‹  margin์ด ๋” ์ปค์กŒ๋‹ค.
    cost๊ฐ€ ๋งค์šฐ ๋†’์€ ๊ฒฝ์šฐ๋ณด๋‹ค test error๋Š” ๋” ๋‚ฎ์„ ๊ฒƒ์ด๋‹ค!

    9.6.2. Support Vector Machine

    Non-linear kernel์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด
    svm() ํ•จ์ˆ˜์˜ kernel์ธ์ž๋กœ “linear"๋Œ€์‹  ๋‹ค๋ฅธ ๊ฒƒ์„ ์‚ฌ์šฉํ•œ๋‹ค.

    • “polynomial”: d์ธ์ž๋ฅผ ์ด์šฉํ•ด ๋‹คํ•ญ์˜ ์ฐจ์ˆ˜๋ฅผ ์กฐ์ ˆํ•œ๋‹ค
    • “radial”: ```gamma``์ธ์ž๋ฅผ ์ด์šฉํ•ด $\lambda$๋ฅผ ์กฐ์ ˆํ•œ๋‹ค($\lambda$๊ฐ€ ์ž‘์„ ์ˆ˜๋ก ์„ ํ˜•์— ๊ฐ€๊น๋‹ค)
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    library(e1071)
    
    set.seed(1)
    x = matrix(rnorm(200 * 2), ncol = 2)
    x[1:100, ] = x[1:100, ] + 2
    x[101:150, ] = x[101:150, ] - 2
    y = c(rep(1, 150), rep(2, 50))
    dat = data.frame(x = x, 
                     y = as.factor(y))
    plot(x, col = y) # non-linear boundary
    
    

    1
    2
    3
    4
    5
    6
    7
    
    train = sample(200, 100)
    svmfit = svm(y ~ .,
                 data = dat[train, ],
                 kernel = "radial",
                 gamma = 1,
                 cost = 1)
    plot(svmfit, dat[train, ])
    

    1
    
    summary(svmfit)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat[train, ], kernel = "radial", gamma = 1, 
    ##     cost = 1)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  radial 
    ##        cost:  1 
    ## 
    ## Number of Support Vectors:  31
    ## 
    ##  ( 16 15 )
    ## 
    ## 
    ## Number of Classes:  2 
    ## 
    ## Levels: 
    ##  1 2
    

    Cost๋ฅผ ๋†’์—ฌ๋ณด์ž.

    • ๋”์šฑ ๊ตฌ๋ถˆ๊ตฌ๋ถˆํ•ด์ง€๊ณ 
    • overfitting ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค
    1
    2
    3
    4
    5
    6
    
    svmfit = svm(y ~ .,
                 data = dat[train, ],
                 kernel = "radial",
                 gamma = 1,
                 cost = 1e5)
    plot(svmfit, dat[train, ])
    

    SVC์—์„œ ํ—€๋˜ ๊ฒƒ์ฒ˜๋Ÿผ,
    tune() ํ•จ์ˆ˜(10-fold cross validation ์ˆ˜ํ–‰ํ•ด์ค€๋‹ค)๋ฅผ ์ด์šฉํ•ด
    ์ตœ์ ์˜ $\lambda$์™€ cost๊ฐ’์„ ์ฐพ์•„๋ณด์ž.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    set.seed(1)
    tune.out = tune(svm,
                    y ~ .,
                    data = dat[train, ],
                    kernel = "radial",
                    ranges = list(cost = c(0.1, 1, 10, 100, 1000),
                                  gamma = c(0.5, 1, 2, 3, 4)))
    
    help(tune)
    
    ## starting httpd help server ... done
    
    1
    
    summary(tune.out)
    
    ## 
    ## Parameter tuning of 'svm':
    ## 
    ## - sampling method: 10-fold cross validation 
    ## 
    ## - best parameters:
    ##  cost gamma
    ##     1   0.5
    ## 
    ## - best performance: 0.07 
    ## 
    ## - Detailed performance results:
    ##     cost gamma error dispersion
    ## 1  1e-01   0.5  0.26 0.15776213
    ## 2  1e+00   0.5  0.07 0.08232726
    ## 3  1e+01   0.5  0.07 0.08232726
    ## 4  1e+02   0.5  0.14 0.15055453
    ## 5  1e+03   0.5  0.11 0.07378648
    ## 6  1e-01   1.0  0.22 0.16193277
    ## 7  1e+00   1.0  0.07 0.08232726
    ## 8  1e+01   1.0  0.09 0.07378648
    ## 9  1e+02   1.0  0.12 0.12292726
    ## 10 1e+03   1.0  0.11 0.11005049
    ## 11 1e-01   2.0  0.27 0.15670212
    ## 12 1e+00   2.0  0.07 0.08232726
    ## 13 1e+01   2.0  0.11 0.07378648
    ## 14 1e+02   2.0  0.12 0.13165612
    ## 15 1e+03   2.0  0.16 0.13498971
    ## 16 1e-01   3.0  0.27 0.15670212
    ## 17 1e+00   3.0  0.07 0.08232726
    ## 18 1e+01   3.0  0.08 0.07888106
    ## 19 1e+02   3.0  0.13 0.14181365
    ## 20 1e+03   3.0  0.15 0.13540064
    ## 21 1e-01   4.0  0.27 0.15670212
    ## 22 1e+00   4.0  0.07 0.08232726
    ## 23 1e+01   4.0  0.09 0.07378648
    ## 24 1e+02   4.0  0.13 0.14181365
    ## 25 1e+03   4.0  0.15 0.13540064
    

    9.6.3. ROC Curves

    ROCR ํŒจํ‚ค์ง€ ์‚ฌ์šฉ

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    library(ROCR)
    
    # numerical score (pred)์™€ class label (truth)๋ฅผ ์ž…๋ ฅ๋ฐ›์•„
    
    # ROC ์ปค๋ธŒ๋ฅผ ๊ทธ๋ฆฌ๋Š” ํ•จ์ˆ˜
    
    rocplot = function(pred, truth, ...){
      predob = prediction(pred, truth)
      perf = performance(predob, "tpr", "fpr")
      plot(perf, ...)
    }
    

    Support vector regressor์™€ support vector machine

    • support vector regressor๋Š” $\beta$~0~ + $\beta$~1~X~1~ + $\beta$~1~X~2~ + … + $\beta$~1~X~p~๋ฅผ predicted value๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
    • support vector machine์€ ์œ„ ๊ฐ’์˜ ๋ถ€ํ˜ธ๋ฅผ ์ด์šฉํ•ด class label์„ ๋ถ€์—ฌํ•œ๋‹ค.
    • svm()์— ์ž…๋ ฅํ•˜๋Š” response variable์ด numeric์ด๋ฉด svr, factor๋ฉด svm

    svm()์—์„œ๋Š”…

    • decision.values = TRUE๋กœ ๋‘๋ฉด, class๊ฐ€ ์•„๋‹Œ fitted value๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค
    • predict() ๊ฒฐ๊ณผ ์ค‘ decision.values ์†์„ฑ์— fitted value๊ฐ€ ๋‚˜ํƒ€๋‚˜ ์žˆ๋‹ค
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    svmfit.opt = svm(y ~ .,
                     data = dat[train, ],
                     kernel = "radial",
                     gamma = 2,
                     cost = 1,
                     decision.values = TRUE)
    fitted = attributes(predict(svmfit.opt, 
                                dat[train, ], 
                                decision.values = TRUE))$decision.values
    
    rocplot(fitted, dat[train , "y"],
            main = "Training Data")
    

    ๊ฝค ์ž˜ ๋งž์ถ”๋Š” ๋“ฏํ•˜๋‹ค. ์—ฌ๊ธฐ์„œ $\lambda$๋ฅผ ๋†’์ด๋ฉด(more non-linear) training fitness๋ฅผ ๋” ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
    svmfit.flex = svm(y ~ .,
                      data = dat[train, ],
                      kernel = "radial",
                      gamma = 50,
                      cost = 1,
                      decision.values = TRUE)
    fitted.flex = attributes(predict(svmfit.flex,
                                     dat[train, ],
                                     decision.values = TRUE))$decision.values
    
    par(mfrow = c(1, 1))
    rocplot(fitted, dat[train , "y"],
            main = "Training Data")
    rocplot(fitted.flex, dat[train, "y"],
            add = TRUE, col = "red")
    

    $\lambda$๋ฅผ ๋†’์˜€์„ ๋•Œ ํ›จ์”ฌ ์ •ํ™•ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค.
    ๊ทธ๋Ÿฌ๋‚˜ test error๋Š”…

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    fitted.test = attributes(predict(svmfit.opt,
                                     dat[-train, ],
                                     decision.values = TRUE))$decision.values
    fitted.flex.test = attributes(predict(svmfit.flex,
                                          dat[-train, ], 
                                          decision.values = TRUE))$decision.values
    rocplot(fitted.test, dat[-train, "y"],
            main = "Test Data")
    rocplot(fitted.flex.test, dat[-train, "y"],
            add = TRUE, col = "red")
    

    9.6.4. SVM with Multiple Classes

    Class๊ฐ€ ์„ธ ๊ฐœ ์ด์ƒ์ด๋ฉด svm()์€ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ one-versus-one approach๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

    1
    2
    3
    4
    5
    6
    7
    8
    
    set.seed(1)
    x = rbind(x, 
              matrix(rnorm(50 * 2), ncol = 2))
    y = c(y, rep(0, 50))
    x[y == 0, 2] = x[y == 0, 2] + 2
    dat = data.frame(x = x , y = as.factor(y))
    par(mfrow = c(1, 1))
    plot(x, col = (y + 1))
    

    1
    2
    3
    4
    5
    6
    
    svmfit = svm(y ~ .,
                 data = dat,
                 kernel = "radial",
                 cost = 10,
                 gamma = 1)
    plot(svmfit, dat)
    

    9.6.5. Application to Gene Expression Data

    1
    2
    
    library(ISLR)
    names(Khan)
    
    ## [1] "xtrain" "xtest"  "ytrain" "ytest"
    
    1
    
    str(Khan)
    
    ## List of 4
    ##  $ xtrain: num [1:63, 1:2308] 0.7733 -0.0782 -0.0845 0.9656 0.0757 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:63] "V1" "V2" "V3" "V4" ...
    ##   .. ..$ : NULL
    ##  $ xtest : num [1:20, 1:2308] 0.14 1.164 0.841 0.685 -1.956 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:20] "V1" "V2" "V4" "V6" ...
    ##   .. ..$ : NULL
    ##  $ ytrain: num [1:63] 2 2 2 2 2 2 2 2 2 2 ...
    ##  $ ytest : num [1:20] 3 2 4 2 1 3 4 2 3 1 ...
    
    1
    
    table(Khan$ytrain)
    
    ## 
    ##  1  2  3  4 
    ##  8 23 12 20
    

    feature ๊ฐœ์ˆ˜(2308)๊ฐ€ train ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜(63)์— ๋น„ํ•ด ํ›จ์”ฌ ๋งŽ๋‹ค.
    ์ด๋Ÿฐ ๊ฒฝ์šฐ ๊ตณ์ด non-linear kernel์„ ์‚ฌ์šฉํ•  ํ•„์š” ์—†๋‹ค.

    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ ๊ณ  feature๊ฐ€ ๋งŽ์œผ๋ฉด hyperplane์„ ์ฐพ๊ธฐ ์‰ฝ๋‹ค(๊ณ  ํ•œ๋‹ค…).
    1
    2
    3
    4
    5
    6
    7
    
    dat = data.frame(x = Khan$xtrain,
                     y = as.factor(Khan$ytrain))
    out = svm(y ~ .,
              data = dat,
              kernel = "linear",
              cost = 10)
    summary(out)
    
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  10 
    ## 
    ## Number of Support Vectors:  58
    ## 
    ##  ( 20 20 11 7 )
    ## 
    ## 
    ## Number of Classes:  4 
    ## 
    ## Levels: 
    ##  1 2 3 4
    
    1
    
    table(out$fitted, dat$y) # no training error
    
    
    ##    
    ##      1  2  3  4
    ##   1  8  0  0  0
    ##   2  0 23  0  0
    ##   3  0  0 12  0
    ##   4  0  0  0 20
    
    1
    2
    3
    4
    
    dat.te = data.frame(x = Khan$xtest,
                        y = as.factor(Khan$ytest))
    pred.te = predict(out, newdata = dat.te)
    table(pred.te, dat.te$y) # test error: 20๊ฐœ ์ค‘ 2๊ฐœ ํ‹€๋ ธ๋‹ค
    
    
    ##        
    ## pred.te 1 2 3 4
    ##       1 3 0 0 0
    ##       2 0 6 2 0
    ##       3 0 0 4 0
    ##       4 0 0 0 5
    
    Share on

    Hoontaek Lee
    WRITTEN BY
    Hoontaek Lee
    Tree-Forest-Climate Researcher

    What's on this Page