๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[CS231n] 3. Loss Functions and Optimization ๋ณธ๋ฌธ

AIFFEL ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[CS231n] 3. Loss Functions and Optimization

yeon42 2022. 1. 10. 13:20
728x90

์–ด๋–ค W๊ฐ€ ๊ฐ€์žฅ ์ข‹์€์ง€ -> ์ง€๊ธˆ ๋งŒ๋“  W๊ฐ€ ์ข‹์€์ง€ ๋‚˜์œ์ง€๋ฅผ ์ •๋Ÿ‰ํ™” ํ•  ๋ฐฉ๋ฒ• ํ•„์š”

- W๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๊ฐ ์Šค์ฝ”์–ด๋ฅผ ํ™•์ธํ•˜๊ณ , ์ด W๊ฐ€ ์ง€๊ธˆ ์–ผ๋งˆ๋‚˜ bad ํ•œ์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋งํ•ด์ฃผ๋Š” ๊ฒƒ: Loss Function

- ํ–‰๋ ฌ W๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฒฝ์šฐ์˜ ์ˆ˜ ์ค‘ ๊ฐ€์žฅ ๋œ bad ํ•œ W๊ฐ€ ๋ฌด์—‡์ธ์ง€ : Optimization

 

* ์ •๋‹ต score๊ฐ€ ๋‹ค๋ฅธ score๋ณด๋‹ค ๋†’์œผ๋ฉด ์ข‹๋‹ค !!!

 

 

Multi-class SVM loss

: ์—ฌ๋Ÿฌ ํด๋ž˜์Šค๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•œ ์ด์ง„ SVM์˜ ์ผ๋ฐ˜ํ™”๋œ ํ˜•ํƒœ

- ์†์‹คํ•จ์ˆ˜ L_i์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„  ์šฐ์„  'true์ธ ์นดํ…Œ๊ณ ๋ฆฌ'๋ฅผ ์ œ์™ธํ•œ '๋‚˜๋จธ์ง€ ์นดํ…Œ๊ณ ๋ฆฌ Y'์˜ ํ•ฉ์„ ๊ตฌํ•œ๋‹ค.

 (์ฆ‰, ๋งž์ง€ ์•Š์€ ์นดํ…Œ๊ณ ๋ฆฌ ์ „๋ถ€๋ฅผ ํ•ฉ์นจ)

- and ์˜ฌ๋ฐ”๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ์˜ score๊ณผ ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š์€ ์นดํ…Œ๊ณ ๋ฆฌ์˜ score์„ ๋น„๊ตํ•œ๋‹ค.

- If ์˜ฌ๋ฐ”๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ score > ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š์€ ์นดํ…Œ๊ณ ๋ฆฌ score

   - ์ผ์ • safety margin ์ด์ƒ์ด๋ผ๋ฉด (ex. 1)

   - then, true score๊ฐ€ ๋‹ค๋ฅธ false ์นดํ…Œ๊ณ ๋ฆฌ๋ณด๋‹ค ํ›จ์”ฌ ํฌ๋‹ค๋Š” ๋œป

   - Loss = 0, ์ด๋ฏธ์ง€ ๋‚ด ์ •๋‹ต์ด ์•„๋‹Œ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ชจ๋“  ๊ฐ’์„ ํ•ฉ์น˜๋ฉด ์ด ์ด๋ฏธ์ง€์˜ ์ตœ์ข… Loss๊ฐ€ ๋˜๋Š” ๊ฒƒ

   - then, ์ „์ฒด training data set์—์„œ ๊ทธ loss๋“ค์˜ ํ‰๊ท ์„ ๊ตฌํ•จ

 

 If true class์˜ score๊ฐ€ ์ œ์ผ ๋†’์œผ๋ฉด:
 	then, max(0, s_j - s_yi + 1)

 

 

 

hinge loss

- 0๊ณผ ๋‹ค๋ฅธ ๊ฐ’์˜ ์ตœ๋Œ“๊ฐ’, Max(0, value)์™€ ๊ฐ™์€ ์‹์œผ๋กœ ์†์‹คํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ฌ

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

์ •๋‹ต ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์ ์ˆ˜๊ฐ€ ๋†’์•„์งˆ์ˆ˜๋ก Loss๊ฐ€ ์„ ํ˜•์ ์œผ๋กœ ์ค„์–ด๋“ ๋‹ค.

- ์ด Loss๋Š” 0์ด ๋œ ์ดํ›„์—๋„ Safety margin์„ ๋„˜์–ด์„ค ๋•Œ๊นŒ์ง€ ๋” ์ค„์–ด๋“ ๋‹ค.

  (Loss=0; ํด๋ž˜์Šค๋ฅผ ์ž˜ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค.)

 

* s: ๋ถ„๋ฅ˜๊ธฐ์˜ output์œผ๋กœ ๋‚˜์˜จ ์˜ˆ์ธก๋œ score

ex) 1(๊ณ ์–‘์ด), 2(๊ฐœ)

   -> S_1(๊ณ ์–‘์ด score), S_2(๊ฐœ score)

* y_i: ์ด๋ฏธ์ง€์˜ ์‹ค์ œ ์ •๋‹ต category (์ •์ˆ˜ ๊ฐ’)

* s_y_i: training ์…‹์˜ i๋ฒˆ ์งธ ์ด๋ฏธ์ง€์˜ ์ •๋‹ต class์˜ score

 

 

 

 

(1) Cat์ด ์ •๋‹ต class ์ผ ๊ฒฝ์šฐ

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

Cat์˜ Loss = 5.1 - 3.2 + 1(margin)

                   = 2.9 : Loss๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค!

Car์˜ Loss = 3.2

Frog์˜ Loss = -1.7

 

=> Cat์˜ score๋Š” Frog์˜ score๋ณด๋‹ค ํ›จ์”ฌ ํฌ๋ฏ€๋กœ Loss๋Š” 0์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

๊ณ ์–‘์ด image์˜ Multiclass-SVM Loss๋Š” ์ด๋Ÿฐ ํด๋ž˜์Šค ์Œ์˜ Loss์˜ ํ•ฉ์ด ๋˜๋ฉฐ, ์ฆ‰ 2.9 + 0 = 2.9

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- ์—ฌ๊ธฐ์„œ 2.9๋ผ๋Š” ์ˆซ์ž๋Š” '์–ผ๋งˆ๋‚˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ด ์ด๋ฏธ์ง€๋ฅผ badํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•˜๋Š”์ง€' ์— ๋Œ€ํ•œ ์ฒ™๋„๊ฐ€ ๋  ๊ฒƒ

 

 

 

(2) Car์ด ์ •๋‹ต class ์ผ ๊ฒฝ์šฐ

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- Car & Cat -> Loss=0

- Car & Frog -> Loss=0

 

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 

 

 

(3) Frog์ด ์ •๋‹ต class ์ผ ๊ฒฝ์šฐ

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)
(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 

 

 

(4) ์ตœ์ข… Loss

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- ์ „์ฒด training set์˜ ์ตœ์ข… Loss๋Š” ๊ฐ training image์˜ Loss๋“ค์˜ ํ‰๊ท 

: (2.9 + 0 + 5.27) / 3

- ์šฐ๋ฆฌ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ 5.3์ ๋งŒํผ ์ด training set์„ badํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ๋‹ค๋Š” '์ •๋Ÿ‰์  ์ง€ํ‘œ' !

 

 

 

Q. margin์€ ์–ด๋–ป๊ฒŒ ์ •ํ•˜๋Š” ๊ฒƒ์ธ๊ฐ€?

- ์šฐ๋ฆฌ๋Š” ์‹ค์ œ loss function์˜ score๊ฐ€ ์ •ํ™•ํžˆ ๋ช‡์ธ์ง€๋ฅผ ์‹ ๊ฒฝ์“ฐ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

- ์ค‘์š”ํ•œ ๊ฒƒ์€ ์—ฌ๋Ÿฌ score๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์ฐจ์ด !

- ์ •๋‹ต score๊ฐ€ ๋‹ค๋ฅธ score์— ๋น„ํ•ด ์–ผ๋งˆ๋‚˜ ๋” ํฐ score์„ ๊ฐ€์ง€๊ณ  ์žˆ๋А๋ƒ! ๊ฐ€ ์ค‘์š”

 

 

 

Q1. What happens to loss if car scores change a bit?

A. Car์˜ score๋ฅผ ์กฐ๊ธˆ ๋ฐ”๊พธ๋”๋ผ๋„ Loss๊ฐ€ ๋ฐ”๋€Œ์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค.

  - SVM loss๋Š” ์˜ค์ง ์ •๋‹ต score๊ณผ ๊ทธ ์ด์™ธ์˜ score๋งŒ ๊ณ ๋ คํ–ˆ๋‹ค.

  - Car์˜ score๊ฐ€ ์ด๋ฏธ ๋‹ค๋ฅธ score๋“ค๋ณด๋‹ค ์—„์ฒญ ๋†’๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ตญ Loss๋Š” ๋ณ€ํ•˜์ง€ ์•Š๊ณ  0์ผ ๊ฒƒ์ด๋‹ค.

 

Q2. What is the min/max possible loss?

A. min = 0

    max = infinity  (hinge loss๋ฅผ ๋ด๋„ ์•Œ ์ˆ˜ ์žˆ์Œ)

 

Q3. At initialization W is small so all s=0, what is the loss?

- ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œ๊ธธ ๋•Œ ๋ณดํ†ต W๋ฅผ ์ž„์˜์˜ ์ž‘์€ ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”์‹œํ‚ค๋Š”๋ฐ, 

- ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ฒ˜์Œ ํ•™์Šต ์‹œ ๊ฒฐ๊ณผ score๊ฐ€ ์ž„์˜์˜ ์ผ์ • ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค.

A. number of classes - 1

- Loss๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์ •๋‹ต์ด ์•„๋‹Œ class๋“ค์„ ์ˆœํšŒํ•œ๋‹ค. = C-1 class๋“ค์„ ์ˆœํšŒํ•จ

- ๋น„๊ตํ•˜๋Š” ๋‘ score๊ฐ€ ๊ฑฐ์˜ ๋น„์Šทํ•˜๋‹ˆ margin ๋•Œ๋ฌธ์— 1 score๋ฅผ ์–ป๊ฒŒ ๋  ๊ฒƒ

- Loss = C-1

 

Q4. What if the sum was over all classes? (including j=y_i)

(SVM์€ ์ •๋‹ต์ธ class ๋นผ๊ณ  ๋‹ค ๋”ํ•œ ๊ฒƒ์ธ๋ฐ, ์ •๋‹ต์ธ class๋„ ๋”ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?)

A. Loss์— 1์ด ๋” ์ฆ๊ฐ€ํ•  ๊ฒƒ

 

Q5. What if we used mean instead of sum?

A. ์˜ํ–ฅ์„ ์•ˆ ๋ฏธ์นœ๋‹ค. (scale๋งŒ ๋ณ€ํ•  ๋ฟ)

 

Q6. What if we used

A. ๊ฒฐ๊ณผ๋Š” ๋‹ฌ๋ผ์งˆ ๊ฒƒ

- ์‹ค์ œ๋กœ๋„ squared hinge loss๋ฅผ ์‚ฌ์šฉํ•จ

 

 

 

 

Multiple SVM Loss: Example code

def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores-scores[y] + 1)
    margins[y] = 0 # max๋กœ ๋‚˜์˜จ ๊ฒฐ๊ณผ์—์„œ ์ •๋‹ต class๋งŒ 0์œผ๋กœ ๋งŒ๋“ค์–ด์คŒ
    		# ๊ตณ์ด ์ „์ฒด๋ฅผ ์ˆœํšŒํ•  ํ•„์š” ์—†๊ฒŒ ํ•ด์ฃผ๋Š” ์ผ์ข…์˜ vectorized ๊ธฐ๋ฒ•
    loss_i = np.sum(margins)
    return loss_i

- numpy๋ฅผ ์ด์šฉํ•˜๋ฉด loss function์„ ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ๋‹ค. (easy)

- max๋กœ ๋‚˜์˜จ ๊ฒฐ๊ณผ์—์„œ ์ •๋‹ต class๋งŒ 

 

 

E.g. Suppose that we found a W s.t L=0. Is that W unique?

A. ๋‹ค๋ฅธ W๋„ ์กด์žฌํ•œ๋‹ค. ex. 2W is also has L=0.

-> margin๋„ 2๋ฐฐ์ผ ๊ฒƒ

 

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 

 

์šฐ๋ฆฌ๋Š” ์˜ค์ง data์˜ loss์—๋งŒ ์‹ ๊ฒฝ์„ ์“ฐ๊ณ  ์žˆ๊ณ , ๋ถ„๋ฅ˜๊ธฐ์—๊ฒŒ training data์— ๊ผญ ๋งž๋Š” W๋ฅผ ์ฐพ์œผ๋ผ๊ณ  ๋งํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

-> ์‹ค์ œ ์šฐ๋ฆฌ๋Š” training data์— ์–ผ๋งˆ๋‚˜ ๊ผญ ๋งž๋Š”์ง€๋Š” ์ „ํ˜€ ์‹ ๊ฒฝ์“ฐ์ง€ x

-> training data๋ฅผ ์ด์šฉํ•ด ์–ด๋–ค ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ฐพ๊ณ , ์ด๋ฅผ test data์— ์ ์šฉํ•  ๊ฒƒ

-> test data์—์„œ์˜ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•˜๋‹ค.

-> training data์—์„œ์˜ Loss๋งŒ ์‹ ๊ฒฝ์“ด๋‹ค๋ฉด ์ข‹์ง€ x

 

 

 

Regularization

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- ๋ชจ๋ธ์ด ์ข€ ๋” ๋‹จ์ˆœํ•œ W์„ ์ฐพ๋„๋ก ํ•จ

- Loss Function์€ Data Loss์™€ Regularization์˜ ๋‘ ๊ฐ€์ง€ ํ•ญ์„ ๊ฐ€์ง„๋‹ค.

   - ๋žŒ๋‹ค: ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ; ์‹ค์ œ ๋ชจ๋ธ ํ›ˆ๋ จํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ค‘์š”ํ•œ ์š”์†Œ

 

์ข…๋ฅ˜

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 

- Regularization์€ ๋ชจ๋ธ์ด training data set์— ์™„๋ฒฝํžˆ fitํ•˜์ง€ ๋ชปํ•˜๋„๋ก ๋ชจ๋ธ์˜ ๋ณต์žก๋„์— penalty๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•

(overfitting์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์•Œ๊ณ  ์žˆ์Œ)

 

 

 

 

Softmax

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 - score์„ ์ „๋ถ€ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, score๋“ค์— ์ง€์ˆ˜๋ฅผ ์ทจํ•ด ์–‘์ˆ˜๊ฐ€ ๋˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

- ๊ทธ ์ง“๋“ค์˜ ํ•ฉ์œผ๋กœ ๋‹ค์‹œ ์ •๊ทœํ™” ์‹œํ‚ด

=> softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜๋ฉด ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ

   (ํ•ด๋‹น ํด๋ž˜์Šค์ผ ํ™•๋ฅ )

 

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- softmax์—์„œ ๋‚˜์˜จ ํ™•๋ฅ ์ด ์ •๋‹ต class์— ํ•ด๋‹นํ•˜๋Š” ํ™•๋ฅ ์„ 1๋กœ ๋‚˜ํƒ€๋‚˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ

- ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฒƒ) ์ •๋‹ต class์— ํ•ด๋‹นํ•˜๋Š” class์˜ ํ™•๋ฅ ์ด 1์— ๊ฐ€๊น๊ฒŒ ๊ณ„์‚ฐ๋˜๋Š” ๊ฒƒ

   - Loss๋Š” '-log(์ •๋‹ตclassํ™•๋ฅ )' ์ด ๋  ๊ฒƒ

   - Loss function์€ '์–ผ๋งˆ๋‚˜ ์ข‹์€์ง€'๊ฐ€ ์•„๋‹ˆ๋ผ '์–ผ๋งˆ๋‚˜ badํ•œ์ง€'๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งˆ์ด๋„ˆ์Šค(-)๋ฅผ ๋ถ™์ธ๋‹ค.

 

 

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

- score๊ฐ€ ์žˆ์œผ๋ฉด softmax๋ฅผ ๊ฑฐ์น˜๊ณ , ๋‚˜์˜จ ํ™•๋ฅ  ๊ฐ’์— -log๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

 

 

 

 

๊ณ ์–‘์ด ์˜ˆ์ œ๋กœ ๋Œ์•„๊ฐ„๋‹ค๋ฉด, (linear classifier์˜ output์œผ๋กœ ๋‚˜์˜จ / SVM Loss์˜ ๊ฒฐ๊ณผ) score ์ž์ฒด๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ง€์ˆ˜ํ™” ์‹œ์ผœ ์“ฐ์ž.

-> ํ•ฉ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™” ์‹œ์ผœ ์ฃผ๊ธฐ

-> ์ •๋‹ต score์—๋งŒ -log๋ฅผ ์ทจํ•ด์ฃผ๊ธฐ : "softmax" / "multinomial logistic regression"

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

 

Q. What is the min/max possible loss L_i?

A. min = 0, max = infinity

 

- ์šฐ๋ฆฌ๋Š” ์ •๋‹ต class์˜ ํ™•๋ฅ ์€ 1์ด ๋˜๊ธฐ๋ฅผ, ์ •๋‹ต์ด ์•„๋‹Œ class์˜ ํ™•๋ฅ ์€ 0์ด ๋˜๊ธธ ์›ํ•œ๋‹ค.

- ์ฆ‰, log ์•ˆ์˜ ์–ด๋–ค ๊ฐ’์€ 1์ด ๋˜์–ด์•ผ ํ•œ๋‹ค. => -log(1) = 0

   - ๊ณ ์–‘์ด๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค๋ฉด Loss=0 ์ผ ๊ฒƒ

 

Q. Loss=0์ด๋ผ๋ฉด ์‹ค์ œ score์€ ์–ด๋–ป๊ฒŒ ๋˜์–ด์•ผ ํ• ๊นŒ?

A. score๋Š” infinity์— ๊ฐ€๊นŒ์›Œ์•ผ ํ•  ๊ฒƒ์ด๋‹ค.

 

- ๋งŒ์•ฝ ์ •๋‹ต class์˜ ํ™•๋ฅ ์ด 0 => -log(0) : + infinity (๋ถˆ๊ฐ€๋Šฅ, ์ง€์ˆ˜=0์ด ๋  ์ˆ˜ ์—†๋‹ค.)

- '์œ ํ•œ ์ •๋ฐ€๋„'๋ฅผ ๊ฐ€์ง€๊ณ  ์ตœ๋Œ“๊ฐ’(๋ฌดํ•œ๋Œ€), ์ตœ์†Ÿ๊ฐ’(0)์— ๋„๋‹ฌํ•  ์ˆ˜ ์—†๋‹ค.

 

 

*

์ตœ์ข… Loss Function์ด ์ตœ์†Œ๊ฐ€ ๋˜๊ฒŒ ํ•˜๋Š” ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์ด์ž ํŒŒ๋ผ๋ฏธํ„ฐ์ธ ํ–‰๋ ฌ W๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ

-> ์–ด๋–ป๊ฒŒ Loss๋ฅผ ์ค„์ด๋Š” W๋ฅผ ์ฐพ๋Š” ๊ฒƒ ??

-> "์ตœ์ ํ™”(Optimization)"

 

 

Optimization

(1) A first very bad idea solution: Random search (์ž„์˜ ํƒ์ƒ‰)

# assume X_train is the data where each column is an example (e.g. 3073 x 50,000)
# assume Y_train are teh labels (e.g. 1D array of 50,000)
# assume the function L evaluates the loss function

bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
    W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
    loss = L(X_train, Y_train, W)
    if loss < bestloss: # keep track of the best solution
        bestloss = loss
        bestW = W
    print ('in attempt %d the loss was %f, best %f' %(num, loss, bestloss))

- ์ž„์˜๋กœ ์ƒ˜ํ”Œ๋งํ•œ W๋“ค์„ ์—„์ฒญ ๋งŽ์ด ๋ชจ์•„๋‘๊ณ  Loss๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์–ด๋–ค W๊ฐ€ ์ข‹์€์ง€ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ

 

Let's see how well this works on the test set ...

# Assume X_test is [3073 x 10000], Y_test [10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples

# find the index with max score in each column (the predicted class)
Yte_predict = np.argmax(scores, axis=0)

# and calculate accuracy (fraction of predictions that are correct)
np.ean(Yte_prdict == Yte)

# returns 0.1555

- CIFAR-10์—์„œ class๋Š” 10๊ฐœ์ด๋ฏ€๋กœ ์ž„์˜ ํ™•๋ฅ ์€ 10%๊ฐ€ ๋˜๊ณ ,

  random search๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜๋ฉด  15%์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์ž„

 

 

(2) Follow the slope: Local gemetry

- NN, Linear Classifier์„ ํ›ˆ๋ จ์‹œํ‚ฌ ๋•Œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•

 

* ๊ฒฝ์‚ฌ(slope): 1์ฐจ์› ๊ณต๊ฐ„์—์„œ๋Š” ์–ด๋–ค ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฏธ๋ถ„๊ฐ’

= ๋„ํ•จ์ˆ˜(derivative)

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

-> ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์—์„œ๋„ ํ™•์žฅ ๊ฐ€๋Šฅ

- ๋‹ค๋ณ€์ˆ˜์—์„œ ๋ฏธ๋ถ„์œผ๋กœ ์ผ๋ฐ˜ํ™” ์‹œํ‚จ ๊ฒƒ์ด gradient

 

gradient์˜ ๋ฐฉํ–ฅ: ํ•จ์ˆ˜์—์„œ '๊ฐ€์žฅ ๋งŽ์ด ์˜ฌ๋ผ๊ฐ€๋Š” ๋ฐฉํ–ฅ'

 

 

 

ํฌ๊ณ  ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

Gradient Descent

while True:
    weights_grad = evaluate_gradient(loss_func, data, weights)
    weights += - step_size * weights_grad # perform parameter update

- ์šฐ์„  W๋ฅผ ์ž„์˜์ด ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”์‹œํ‚ด

- then, Loss์™€ gradient๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค ๊ฐ€์ค‘์น˜๋ฅผ gradient์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ update

   - gradient๊ฐ€ ํ•จ์ˆ˜์—์„œ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์ด๊ธฐ ๋•Œ๋ฌธ์— -gradient๋ฅผ ํ•ด์•ผ ๋‚ด๋ ค๊ฐ€๋Š” ๋ฐฉํ–ฅ์ด ๋จ

   - ๋ฐ˜๋ณตํ•˜๋‹ค๋ณด๋ฉด ๊ฒฐ๊ตญ ์ˆ˜๋ ดํ•  ๊ฒƒ

- step size๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ; gradient ๋ฐฉํ–ฅ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋‚˜์•„๊ฐ€์•ผ ํ•˜๋Š”์ง€๋ฅผ ์•Œ๋ ค์คŒ

   - learning rate๋ผ๊ณ ๋„ ํ•˜๋ฉฐ, ๋งค์šฐ ์ค‘์š” !

 

 

Stochastic Gradient Descent (SGD)

- Loss Function์—์„œ ์šฐ๋ฆฌ๋Š” ์ „์ฒด training set๋“ค์˜ loss์˜ ํ‰๊ท ์„ ๊ตฌํ–ˆ๋‹ค.

- but, ์‹ค์ œ๋กœ N์ด ์—„์ฒญ๋‚˜๊ฒŒ ์ปค์งˆ ์ˆ˜ ์žˆ๋‹ค.

(์ถœ์ฒ˜: Standford University CS231n, Spring 2017)

-> Loss๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ๊ต‰์žฅํžˆ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์ž‘์—…์ด ๋จ (์ˆ˜๋ฐฑ๋งŒ ๋ฒˆ์˜ ๊ณ„์‚ฐ)

# vanilla minibath gradient descent

while True:
    data_batch = sample_training_data(data, 256) # sample 256 examples
    weights_grad = evaluate_gradient(loss_func, data_batch, weights)
    weights += - step_size * weigths_grad # perform parameter update

- ์ „์ฒด data set์˜ gradient์™€ loss๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋ณด๋‹ค๋Š”, minibatch๋ผ๋Š” ์ž‘์€ training sample ์ง‘ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ„์–ด ํ•™์Šตํ•จ

- ๋ณดํ†ต 2์˜ ์Šน์ˆ˜ (32, 64, 128)

- ์ด ์ž‘์€ minibatch๋ฅผ ์ด์šฉํ•ด loss์˜ ์ „์ฒด ํ•ฉ์˜ '์ถ”์ •์น˜'์™€ ์‹ค์ œ gradient์˜ '์ถ”์ •์น˜'๋ฅผ ๊ณ„์‚ฐ

 

 

 

 

 

 

 

 

 

 

 

Comments