๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

์ˆ˜์—… ๋“ค์€ ๊ธฐ๋… ๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree) ๋‹ค์‹œ ์ •๋ฆฌํ•˜๊ธฐ ํ˜ธํ˜ธ ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/Machine Learning

์ˆ˜์—… ๋“ค์€ ๊ธฐ๋… ๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree) ๋‹ค์‹œ ์ •๋ฆฌํ•˜๊ธฐ ํ˜ธํ˜ธ

yeon42 2021. 11. 3. 19:00
728x90

https://bkshin.tistory.com/entry/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-4-%EA%B2%B0%EC%A0%95-%ED%8A%B8%EB%A6%ACDecision-Tree

 

๋จธ์‹ ๋Ÿฌ๋‹ - 4. ๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree)

๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree, ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ, ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ผ๊ณ ๋„ ํ•จ)๋Š” ๋ถ„๋ฅ˜(Classification)์™€ ํšŒ๊ท€(Regression) ๋ชจ๋‘ ๊ฐ€๋Šฅํ•œ ์ง€๋„ ํ•™์Šต ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ์Šค๋ฌด๊ณ ๊ฐœ ํ•˜๋“ฏ์ด ์˜ˆ/์•„๋‹ˆ์˜ค ์งˆ๋ฌธ์„

bkshin.tistory.com

์œ„ ๋ธ”๋กœ๊ทธ๋ฅผ ํ•„์‚ฌํ•˜๋ฉฐ ๊ณต๋ถ€

 

* ๋ชจ๋“  ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ์ถœ์ฒ˜๋Š” ์œ„ ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.

 


 

* ๊ฒฐ์ • ํŠธ๋ฆฌ (Decision Tree)

 

: ํŠน์ • ๊ธฐ์ค€(์งˆ๋ฌธ)์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ

- ํ•œ ๋ฒˆ์˜ ๋ถ„๊ธฐ ๋•Œ๋งˆ๋‹ค ๋ณ€์ˆ˜ ์˜์—ญ์„ ๋‘ ๊ฐœ๋กœ ๊ตฌ๋ถ„

 

์ถœ์ฒ˜: ratsgo's blog

 

์ถœ์ฒ˜: ํ…์„œ ํ”Œ๋กœ์šฐ ๋ธ”๋กœ๊ทธ

- ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ๋“ค๋กœ ๊ธฐ์ค€์„ ๋‚˜๋ˆ”

- ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์ด ํ•˜๋ฉด ์œ„์ฒ˜๋Ÿผ ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋จ.

   - ๊ฒฐ์ • ํŠธ๋ฆฌ์— ์•„๋ฌด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฃผ์ง€ ์•Š๊ณ  ๋ชจ๋ธ๋งํ•˜๋ฉด ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋จ

 

 

 

 

๊ฐ€์ง€์น˜๊ธฐ(Pruning)

: ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ง‰๊ธฐ ์œ„ํ•œ ์ „๋žต

 

- ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด๋‚˜ ํ„ฐ๋ฏธ๋„ ๋…ธ๋“œ์˜ ์ตœ๋Œ€ ๊ฐœ์ˆ˜, or ํ•œ ๋…ธ๋“œ๊ฐ€ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ์ˆ˜ ์ œํ•œ

- min_sample_split ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ํ•œ ๋…ธ๋“œ์— ๋“ค์–ด์žˆ๋Š” ์ตœ์†Œ ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ

ex. min_sample_split=10์ด๋ผ๋ฉด, ํ•œ ๋…ธ๋“œ์— 10๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ๊ทธ ๋…ธ๋“œ๋Š” ๋” ์ด์ƒ ๋ถ„๊ธฐ x

- max_depth๋ฅผ ํ†ตํ•ด ์ตœ๋Œ€ ๊นŠ์ด ์ง€์ • ๊ฐ€๋Šฅ

ex. max_depth=4๋ผ๋ฉด, ๊นŠ์ด๊ฐ€ 4๋ณด๋‹ค ํฌ๊ฒŒ ๊ฐ€์ง€๋ฅผ ์น˜์ง€ ์•Š์Œ

 

 

 

 

 

 

์—”ํŠธ๋กœํ”ผ(Entropy), ๋ถˆ์ˆœ๋„(Impurity)

 

- ๋ถˆ์ˆœ๋„(Impurity): ํ•ด๋‹น ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ž์—ฌ ์žˆ๋Š”์ง€๋ฅผ ๋œปํ•จ

์ถœ์ฒ˜: ratsgo's blog

- ์œ„ ๊ทธ๋ฆผ์—์„œ ์œ„์ชฝ ๋ฒ”์ฃผ๋Š” ๋ถˆ์ˆœ๋„๊ฐ€ ๋‚ฎ๊ณ , ์•„๋ž˜์ชฝ ๋ฒ”์ฃผ๋Š” ๋ถˆ์ˆœ๋„๊ฐ€ ๋†’๋‹ค.

 

- ํ•œ ๋ฒ”์ฃผ์— ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋‹ค๋ฉด ๋ถˆ์ˆœ๋„๊ฐ€ ์ตœ์†Œ์ด๊ณ , ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜์ด๋ผ๋ฉด ๋ถˆ์ˆœ๋„๊ฐ€ ์ตœ๋Œ€

- ๊ฒฐ์ •ํŠธ๋ฆฌ๋Š” ๋ถˆ์ˆœ๋„๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•จ

 

- ์—”ํŠธ๋กœํ”ผ(Entropy): ๋ถˆ์ˆœ๋„๋ฅผ ์ˆ˜์น˜์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ์ฒ™๋„

- ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’๋‹ค=๋ถˆ์ˆœ๋„๊ฐ€ ๋†’๋‹ค / ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋‚ฎ๋‹ค=๋ถˆ์ˆœ๋„๊ฐ€ ๋‚ฎ๋‹ค

- ์—”ํŠธ๋กœํ”ผ๊ฐ€ 1 = ๋ถˆ์ˆœ๋„ ์ตœ๋Œ€ (ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜ ์žˆ๋‹ค.)

- ์—”ํŠธ๋กœํ”ผ๊ฐ€ 0 = ๋ถˆ์ˆœ๋„ ์ตœ์†Œ (ํ•œ ๋ฒ”์ฃผ ์•ˆ์— ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋งŒ ์กด์žฌ)

 

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

(Pi = ํ•œ ์˜์—ญ ์•ˆ์— ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ๊ฐ€์šด๋ฐ ๋ฒ”์ฃผ i์— ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ)

 

 

 

(์˜ˆ์ œ)

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

- P_slow = 2/4 = 0.5

- P_fast = 0.5

->> ์—”ํŠธ๋กœํ”ผ: 1 (์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜ ์žˆ๋‹ค.)

 

 

 

 

 

์ •๋ณด ํš๋“ (Information gain)

: ๋ถ„๊ธฐ ์ด์ „์˜ ์—”ํŠธ๋กœํ”ผ์—์„œ ๋ถ„๊ธฐ ์ดํ›„์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๋บ€ ์ˆ˜์น˜

- ์—”ํŠธ๋กœํ”ผ๊ฐ€ 1์ธ ์ƒํƒœ์—์„œ 0.7๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋ฉด ์ •๋ณด ํš๋“์€ 0.3

 

Information gain = entropy(parent) - [weighted average]entropy(children)

 

- entropy(parent): ๋ถ„๊ธฐ ์ด์ „์˜ ์—”ํŠธ๋กœํ”ผ

- entropy(children): ๋ถ„๊ธฐ ์ดํ›„์˜ ์—”ํŠธ๋กœํ”ผ

- [weighted average]entropy(children): entropy(children)์˜ ๊ฐ€์ค‘ ํ‰๊ท 

 

- ๋ถ„๊ธฐ ์ดํ›„ ์—”ํŠธ๋กœํ”ผ์— ๋Œ€ํ•ด ๊ฐ€์ค‘ ํ‰๊ท ์„ ํ•˜๋Š” ์ด์œ : ๋ถ„๊ธฐ๋ฅผ ํ•˜๋ฉด ๋ฒ”์ฃผ๊ฐ€ 2๊ฐœ ์ด์ƒ์œผ๋กœ ์ชผ๊ฐœ์ง€๊ธฐ ๋•Œ๋ฌธ

- ๋ฒ”์ฃผ๊ฐ€ 1๊ฐœ๋ผ๋ฉด ์œ„ ์—”ํŠธ๋กœํ”ผ ๊ณต์‹์œผ๋กœ ๋ฐ”๋กœ ์—”ํŠธ๋กœํ”ผ ๊ตฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ

- ๋ฒ”์ฃผ๊ฐ€ 2๊ฐœ ์ด์ƒ์ด๋ผ๋ฉด ๊ฐ€์ค‘ ํ‰๊ท ์„ ํ™œ์šฉํ•ด ๋ถ„๊ธฐ ์ดํ›„ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ตฌํ•จ

 

 

- ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ •๋ณด ํš๋“์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค.

- ์–ด๋А feature์˜ ์–ด๋А ๋ถ„๊ธฐ์ ์—์„œ ์ •๋ณด ํš๋“์ด ์ตœ๋Œ€ํ™”๋˜๋Š”์ง€ ํŒ๋‹จ์„ ํ•ด ๋ถ„๊ธฐ๊ฐ€ ์ง„ํ–‰๋จ

 

 

 

 

 


 

 

 

(๊ฒฝ์‚ฌ ๊ธฐ์ค€ ๋ถ„๊ธฐ)

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

- steep์€ ์ด 3๊ฐœ์ด๋ฉฐ ์ด ๋•Œ ์†๋„๋Š” ๊ฐ๊ฐ slow, slow, fast

- flat์€ ์ด 2๊ฐœ์ด๋ฉฐ ์ด ๋•Œ ์†๋„๋Š” fast

   --->>> entropy(flat) = 0

 

* steep์œผ๋กœ ๋ถ„๊ธฐํ–ˆ์„ ๋•Œ,

- slow๋Š” 2๊ฐœ --> P_slow=2/3

- fast๋Š” 1๊ฐœ --> P_fast=1/3

   --->>> entropy(steep) = 0.9184 (๊ณต์‹ ์ด์šฉ)

 

 

* ๋ถ„๊ธฐ ์ดํ›„ ๋…ธ๋“œ์— ๋Œ€ํ•œ ๊ฐ€์ค‘ํ‰๊ท ์€?

[weighted average] entropy(children) = weighted average of steep * entropy(steep) + weighted average of flat * entropy(flat)

= 3/4 * 0/9184 + 1/4 * 0

= 0.6888

 

Information gain = entropy(parent) - [weighted average] entropy(children)

= 1 - 0.688

= 0.3112

 

- ๊ฒฝ์‚ฌ feature๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐ๋ฅผ ํ–‡์„ ๋•Œ 0.3112๋งŒํผ์˜ ์ •๋ณด ํš๋“(information gain)์ด ์žˆ๋‹ค๋Š” ๋œป

 

 

 

 

(ํ‘œ๋ฉด ๊ธฐ์ค€ ๋ถ„๊ธฐ)

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

- bumpy๋Š” ์ด 2๊ฐœ์ด๋ฉฐ ์ด ๋•Œ ์†๋„๋Š” slow, fast

- smooth๋Š” ์ด 2๊ฐœ์ด๋ฉฐ ์ด ๋•Œ ์†๋„๋Š” slow, fast

-->> ํ•˜๋‚˜์˜ ๋ฒ”์ฃผ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•ํžˆ ๋ฐ˜๋ฐ˜ ์žˆ์œผ๋ฏ€๋กœ entropy(bumpy) = entropy(smooth) = 1

 

information gain = entropy(parent) - [weighted average] entropy(children)

= 1 - (2/4) * 1 - (2/4) * 1

= 0

 

- ์ฆ‰, ํ‘œ๋ฉด์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐํ–ˆ์„ ๋•Œ๋Š” ์ •๋ณด ํš๋“์ด ์ „ํ˜€ x

 

 

 

 

 

(์†๋„ ์ œํ•œ ๊ธฐ์ค€ ๋ถ„๊ธฐ)

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

- yes์˜ ๊ฒฝ์šฐ slow, slow

- no์˜ ๊ฒฝ์šฐ fast, fast

--->> entropy(yes) = entropy(no) = 0

 

์ฆ‰, information gain = 1

 

 


 

 

- ๊ฒฝ์‚ฌ, ํ‘œ๋ฉด, ์†๋„์ œํ•œ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐ๋ฅผ ํ–ˆ์„ ๋•Œ ์ •๋ณด ํš๋“์€ ๊ฐ๊ฐ 0.3112, 0, 1์ด๋‹ค.

- ๊ฒฐ์ •ํŠธ๋ฆฌ๋Š” ์ •๋ณด ํš๋“์ด ๊ฐ€์žฅ ๋งŽ์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฏ€๋กœ ์ฒซ ๋ถ„๊ธฐ์ ์„ ์†๋„์ œํ•œ ๊ธฐ์ค€์œผ๋กœ ์žก๋Š”๋‹ค.

- ์ด๋Ÿฐ์‹์œผ๋กœ max_depth๋‚˜ min_sample_split์œผ๋กœ ์„ค์ •ํ•œ ๋ฒ”์œ„๊นŒ์ง€ ๋ถ„๊ธฐ๋ฅผ ํ•˜๊ฒŒ ๋œ๋‹ค.

 

์ด๊ฒŒ ๋ฐ”๋กœ ๊ฒฐ์ •ํŠธ๋ฆฌ์˜ ์ „์ฒด์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ !!

 

 

 

 

 

 


์ฒ˜์Œ ์œ„ ๋ธ”๋กœ๊ทธ๋กœ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ณต๋ถ€ํ–ˆ์„ ๋•Œ, ์ •๋ณด ํš๋“ ๋ถ€๋ถ„์ด ์†”์งํžˆ ์ „ํ˜€ ์ดํ•ด๊ฐ€ ๊ฐ€์ง€ ์•Š์•˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‹ค ์ตœ๊ทผ ํ•™๊ต ์ˆ˜์—…์—์„œ ๋ฐฐ์šฐ๊ณ  ๋‹ค์‹œ ๋ณต์Šต์„ ํ•ด๋ณด๋‹ˆ ์ดํ•ด๊ฐ€ ์™์™ .. ๋„˜ ์žฌ๋ฐŒ๋‹ค ์ด๋Ÿฐ ๊ณต๋ถ€ ํ—ˆํ—ˆ

 

Comments