๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ (Naive Bayes Classification) ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/Machine Learning

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ (Naive Bayes Classification)

yeon42 2021. 11. 5. 18:13
728x90

https://bkshin.tistory.com/entry/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-1%EB%82%98%EC%9D%B4%EB%B8%8C-%EB%B2%A0%EC%9D%B4%EC%A6%88-%EB%B6%84%EB%A5%98-Naive-Bayes-Classification

 

๋จธ์‹ ๋Ÿฌ๋‹ - 1. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ (Naive Bayes Classification)

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ์ŠคํŒธ ๋ฉ”์ผ ํ•„ํ„ฐ, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ๊ฐ์ • ๋ถ„์„, ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋“ฑ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๋Š” ๋ถ„๋ฅ˜ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜์— ๋Œ€ํ•ด์„œ ๋ฐฐ์šฐ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ๋จผ์ € ์•Œ์•„์•ผ

bkshin.tistory.com

์œ„ ๋ธ”๋กœ๊ทธ๋ฅผ ํ•„์‚ฌํ•˜๋ฉฐ ๊ณต๋ถ€

 

* ๋ชจ๋“  ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ์ถœ์ฒ˜๋Š” ์œ„ ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.

 


 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ์ŠคํŒธ ๋ฉ”์ผ ํ•„ํ„ฐ, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ๊ฐ์ • ๋ถ„์„, ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋“ฑ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๋Š” ๋ถ„๋ฅ˜ ๊ธฐ๋ฒ•

 

๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•ด ์–ด๋–ค ๋™๋ฌผ์˜ ์‚ฌ์ง„์ด ์žˆ์„ ์‹œ ๊ทธ ๋™๋ฌผ์ด ๊ฐœ์ธ์ง€ ๊ณ ์–‘์ด์ธ์ง€ ์–ผ๋ฃฉ๋ง์ธ์ง€ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์‚ฌ์ „์— ์ˆ˜๋งŽ์€ ๊ฐœ, ๊ณ ์–‘์ด, ์–ผ๋ฃฉ๋ง ์‚ฌ์ง„์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์ž์„ธ, ํ‘œ์ •, ์ƒ๊น€์ƒˆ, ํ„ธ์˜ ์ƒ‰ ๋“ฑ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

ํ•™์Šต๋œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์ดํ›„ ๊ฐœ, ๊ณ ์–‘์ด, ์–ผ๋ฃฉ๋ง์„ ์ •ํ™•ํžˆ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๊ณ , ์ด์   ํ•™์Šต์‹œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ์‚ฌ์ง„ ๋ฟ ์•„๋‹ˆ๋ผ ์ƒˆ๋กœ์šด ์‚ฌ์ง„์œผ๋กœ๋„ ์ •ํ™•ํžˆ ๋ถ„๋ฅ˜ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด๋ ‡๊ฒŒ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถฉ๋ถ„ํžˆ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์ง€๋„ํ•™์Šต(Supervised Learning)์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

์ง€๋„ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•œ ์ฒซ ๋‹จ๊ณ„๋Š” Feature์™€ Label์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

- Label์€ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ (ex. ๊ฐœ, ๊ณ ์–‘์ด, ์–ผ๋ฃฉ๋ง)

- ์ด Label ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์š”์†Œ: Feature (ex. ๋™๋ฌผ์˜ ์ž์„ธ, ํ‘œ์ •, ์ƒ๊น€์ƒˆ, ํ„ธ์˜ ์ƒ‰ ๋“ฑ)

 

์ฆ‰ ์ˆ˜๋งŽ์€ ๋™๋ฌผ์˜ ์ž์„ธ, ํ‘œ์ •, ์ƒ๊น€์ƒˆ, ํ„ธ์˜ ์ƒ‰(Feature)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ทธ ๋™๋ฌผ์ด ๊ฐœ์ธ์ง€ ๊ณ ์–‘์ด์ธ์ง€ ์–ผ๋ฃฉ๋ง์ธ์ง€(Label) ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ๋˜ํ•œ ์ง€๋„ํ•™์Šต์˜ ์ผ์ข…

- ๋”ฐ๋ผ์„œ Feature๊ณผ Label์ด ํ•„์š”ํ•˜๋‹ค.

- Feature์— ๋”ฐ๋ผ Label์„ ๋ถ„๋ฅ˜ํ•˜๋Š”๋ฐ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•

- ๋˜ํ•œ ๋ชจ๋“  Feature๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ ์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฐ€์ •์ด ํ•„์š”

 

 

 

 

Classification Workflow

 

- ๋ถ„๋ฅ˜์˜ ์ฒซ ์Šคํ…์€ feature๊ณผ label์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ

 

label -> ์ŠคํŒธ ๋ฉ”์ผ์ธ์ง€ ์•„๋‹Œ์ง€์˜ ์—ฌ๋ถ€

feature -> ์ŠคํŒธ ๋ฉ”์ผ์˜ ์ œ๋ชฉ ๋ฐ ๋‚ด์šฉ์— ๊ธฐ์žฌ๋œ ๊ด‘๊ณ ์„ฑ ๋‹จ์–ด, ๋น„์†์–ด, ์„ฑ์  ์šฉ์–ด ๋“ฑ

 

 

 

๋ถ„๋ฅ˜๋Š” ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์ง: ํ›ˆ๋ จ ๋‹จ๊ณ„ & ํ…Œ์ŠคํŠธ ๋‹จ๊ณ„

- ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ๋Š” ์ฃผ์–ด์ง„ training data set์„ ํ†ตํ•ด classifier ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ณ ,

  ํ…Œ์ŠคํŠธ ๋‹จ๊ณ„์—์„œ๋Š” classifier ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ(performance)์„ ํ‰๊ฐ€

- ์„ฑ๋Šฅ(performance)๋Š” ์ •๋ฐ€๋„(accuracy), ์ •ํ™•์„ฑ(precision), ์žฌํ˜„์œจ(recall) ๋“ฑ์œผ๋กœ ์ธก์ • ๊ฐ€๋Šฅํ•˜๋‹ค.

 

์ถœ์ฒ˜: DataCamp

 

 

 

 


 

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ(Navie Bayes Classifier)์ด๋ž€?

 

- ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํ†ต๊ณ„์  ๋ถ„๋ฅ˜ ๊ธฐ๋ฒ•

- ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์ง€๋„ํ•™์Šต(supervised learning) ์ค‘ ํ•˜๋‚˜

- ๋น ๋ฅด๊ณ , ์ •ํ™•ํ•˜๋ฉฐ, ๋ฏฟ์„๋งŒํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜; ์ •ํ™•์„ฑ๋„ ๋†’๊ณ  ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์†๋„๋„ ๋น ๋ฆ„

 

- ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” feature๋ผ๋ฆฌ ์„œ๋กœ ๋…๋ฆฝ์ด๋ผ๋Š” ์กฐ๊ฑด์ด ํ•„์š”ํ•œ๋‹ค.

- ์ฆ‰, ์ŠคํŽจ ๋ฉ”์ผ ๋ถ„๋ฅ˜์—์„œ ๊ด‘๊ณ ์„ฑ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜์™€ ๋น„์†์–ด ๊ฐœ์ˆ˜๊ฐ€ ์„œ๋กœ ์—ฐ๊ด€์ด ์žˆ์–ด์„œ๋Š” ์•ˆ๋จ

 

 

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ์˜ ๋™์ž‘

 

ex. ๋‚ ์”จ ์ •๋ณด์™€ ์ถ•๊ตฌ ๊ฒฝ๊ธฐ ์—ฌ๋ถ€์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ -> ๋‚ ์”จ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ•๊ตฌ๋ฅผ ํ•  ๊ฒƒ์ธ์ง€, ์•ˆ ํ•  ๊ฒƒ์ธ์ง€ ํ™•๋ฅ  ๊ตฌํ•˜๊ธฐ

 

์ถœ์ฒ˜: DataCamp

 

- ๋งจ ์™ผ์ชฝ ํ…Œ์ด๋ธ”: ๋‚ ์”จ์— ๋”ฐ๋ผ ์ถ•๊ตฌ๋ฅผ ํ–ˆ๋Š”์ง€ ์•ˆํ–ˆ๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ

- ์ด ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € training์‹œ์ผœ ๋ชจ๋ธ์„ ๋งŒ๋“  ๋’ค ๊ทธ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ด๋–ค ๋‚ ์”จ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ถ•๊ตฌ๋ฅผ ํ• ์ง€ ์•ˆ ํ• ์ง€ ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ 

 

- Frequency Table: ์ฃผ์–ด์ง„ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ํšŸ์ˆ˜๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ

- Likelihood Table1์€ ๊ฐ feature(๋‚ ์”จ)์— ๋Œ€ํ•œ ํ™•๋ฅ , ๊ฐ label(์ถ•๊ตฌ ์—ฌ๋ถ€)์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ

- Likelihood Table2์€ ๊ฐ feature์— ๋Œ€ํ•œ ์‚ฌํ›„ ํ™•๋ฅ ์„ ๊ตฌํ•œ ๊ฒƒ

 

 

 

* Feature๊ฐ€ 1๊ฐœ์ผ ๋•Œ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜

 

Q1. ๋‚ ์”จ๊ฐ€ overcast์ผ ๋•Œ ๊ฒฝ๊ธฐ๋ฅผ ํ•  ํ™•๋ฅ ?

P(Yes|Overcast) = P(Overcast|Yes)*P(Yes) / P(Overcast) <- ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์— ์˜ํ•ด

 

1. ์‚ฌ์ „ ํ™•๋ฅ 

P(Overcast) = 4/14 = 0.29

P(Yes) = 9/14 = 0.64

 

2. ์‚ฌํ›„ ํ™•๋ฅ 

P(Overcast|Yes) = 4/9 = 0.44

 

3. ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ ๊ณต์‹์— ๋Œ€์ž…

P(Yes|Overcast) = 0.44*0.64 / 0.29 = 0.98

 

 

 

Q2. ๋‚ ์”จ๊ฐ€ Overcast์ผ ๋•Œ ๊ฒฝ๊ธฐ๋ฅผ ํ•˜์ง€ ์•Š์€ ํ™•๋ฅ ?

P(No|Overcast) = P(Overcast|No)*P(No) / P(Overcast)

 

1. ์‚ฌ์ „ ํ™•๋ฅ 

P(Overcast) = 4/14 = 0.29

P(No) = 5/14 = 0.36

 

2. ์‚ฌํ›„ ํ™•๋ฅ 

P(Overcast|No) = 0/5 = 0

 

3. ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ ๊ณต์‹์— ๋Œ€์ž…

P(No|Overcast) = 0*0.36 / 0.29 = 0

 

 

-> P(Yes|Overcast) = 0.98, P(No|Overcast) = 0

์ฆ‰, ๋‚ ์”จ๊ฐ€ Overcast์ผ ๋•Œ ์ถ•๊ตฌ๋ฅผ ํ•  ํ™•๋ฅ ์€ 0.98, ์ถ•๊ตฌ๋ฅผ ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ ์€ 0

-> ๋‘ ํ™•๋ฅ ์„ ๋น„๊ตํ•œ ๋’ค ๋” ๋†’์€ ํ™•๋ฅ ์˜ label๋กœ ๋ถ„๋ฅ˜ํ•˜๋ฉด ๋œ๋‹ค.

-> ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋‚ ์”จ๊ฐ€ Overcast์ผ ๋•Œ๋Š” ์ถ•๊ตฌ๋ฅผ ํ•  ๊ฒƒ์ด๋ผ๊ณ  ํŒ๋‹จ

 

 

 

 

 

* Feature๊ฐ€ multiple์ผ ๋•Œ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜

์ถœ์ฒ˜: DataCamp

 

Q1. ๋‚ ์”จ๊ฐ€ overcast & ๊ธฐ์˜จ์ด mild์ผ ๋•Œ ๊ฒฝ๊ธฐ๋ฅผ ํ•  ํ™•๋ฅ ์€?

P(Yes | Overcast, Mild) = P(Overcast, Mild | Yes)*P(Yes) / P(Overcast, Mild)

 

P(Overcast, Mild | Yes) = P(Overcast|Yes) * P(Mild|Yes)

P(Overcast, Mild) = P(Overcast) * P(Mild) = (4/14) * (6/14) = 0.1224

 

1. ์‚ฌ์ „ ํ™•๋ฅ 

P(Yes) = 9/14 = 0.64

 

2. ์‚ฌํ›„ ํ™•๋ฅ 

P(Overcast|Yes) = 4/9 = 0.44

P(Mild|Yes) = 4/9 = 0.44

 

3. ๋ฒ ์ด์ฆˆ ๊ณต์‹์— ๋Œ€์ž…

P(Overcast, Mild | Yes) = P(Overcast|Yes) * P(Mild|Yes) = 0.44 * 0.44 = 0.1936

 

P(Yes | Overcast, Mild) = 0.1936 * 0.64 / 0.1224 = 1

 

 

 

Q2. ๋‚ ์”จ๊ฐ€ overcast & ๊ธฐ์˜จ์ด mild์ผ ๋•Œ ๊ฒฝ๊ธฐ๋ฅผ ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ ์€?

P(No | Overcast, Mild) = P(Overcast, Mild | No) * P(No) / P(Overcast, Mild)

 

P(Overcast, Mild | No) = P(Overcast | No) * P(Mild | No)

 

1. ์‚ฌ์ „ ํ™•๋ฅ 

P(No) = 5/14 = 0.36

 

2. ์‚ฌํ›„ ํ™•๋ฅ 

P(Overcast | No) = 0/5 = 0

P(Mild | No) = 2/5 = 0.4

 

3. ๋ฒ ์ด์ฆˆ ๊ณต์‹์— ๋Œ€์ž…

P(Overcast, Mild | No) = 0 * 0.4 = 0

P(No | Overcast, Mild) = 0 * 0.36 / 0.1224 = 0

 

 

->  ์ถ•๊ตฌ๋ฅผ ํ•  ํ™•๋ฅ ์€ 1์ด๊ณ , ์ถ•๊ตฌ๋ฅผ ํ•˜์ง€ ์•Š์„ ํ™•๋ฅ ์€ 0

- ์ถ•๊ตฌ๋ฅผ ํ•  ํ™•๋ฅ ์ด ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์— ๋‚ ์”จ๊ฐ€ overcast์ด๊ณ  ๊ธฐ์˜จ์ด mild์ผ ๋•Œ๋Š” ์ถ•๊ตฌ๋ฅผ ํ•  ๊ฒƒ์ด๋ผ๊ณ  ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

- ์ด๋ ‡๋“ฏ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ํ™•๋ฅ ์ด ๋” ํฐ label๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

 

 

 


 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ (naive bayes) ์˜ ์žฅ๋‹จ์ 

* ์žฅ์ 

1. ๊ฐ„๋‹จ, ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ๋ชจ๋ธ

2. computation cost๊ฐ€ ์ ์Œ (๋น ๋ฆ„)

3. ํฐ ๋ฐ์ดํ„ฐ์…‹์— ์ ํ•ฉ

4. ์—ฐ์†ํ˜•๋ณด๋‹ค ์ด์‚ฐํ˜• ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์ด ์ข‹์Œ

5. multiple class ์˜ˆ์ธก์„ ์œ„ํ•ด์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

 

* ๋‹จ์ 

feature๊ฐ„ ๋…๋ฆฝ์„ฑ์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค.

- but, ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ๋ชจ๋“  feature๊ฐ€ ๋…๋ฆฝ์ธ ๊ฒฝ์šฐ๋Š” ๋“œ๋ฌผ๋‹ค.

(feature๊ฐ„ ๋…๋ฆฝ์„ฑ์ด ์žˆ๋‹ค: feature๊ฐ„ ์„œ๋กœ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†๋‹ค๋Š” ๋œป)

- so ์‹ค์ƒํ™œ์—์„œ ๋ฐ”๋กœ ์ ์šฉํ•˜๊ธฐ๋Š” ์–ด๋ ค์›€์ด ์žˆ์Œ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments