๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

Frequent Patterns, Association Rules, Closed Pattern, Max Pattern (Data Mining) ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/Machine Learning

Frequent Patterns, Association Rules, Closed Pattern, Max Pattern (Data Mining)

yeon42 2021. 11. 19. 17:26
728x90

https://analytics4everything.tistory.com/95

 

Frequent pattern mining

 ๋ณธ ํฌ์ŠคํŒ…์€ ์—ฐ์„ธ๋Œ€ํ•™๊ต ์‚ฐ์—…๊ณตํ•™๊ณผ ๊น€์šฐ์ฃผ ๊ต์ˆ˜๋‹˜์˜ ๊ฐ•์˜๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฒŒ์‹œํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค. ์šฐ์œ ๋ฅผ ์ž์ฃผ ์‚ฌ๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์–ด๋–ค ์•„์ดํ…œ์„ ์ถ”์ฒœํ•  ์ˆ˜ ์žˆ์„๊นŒ? ์ด๋Ÿฌํ•œ ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๋ ค๋ฉด, ์šฐ์œ ๋ฅผ ์‚ฌ๋Š”

analytics4everything.tistory.com

์œ„ ๋ธ”๋กœ๊ทธ๋ฅผ ํ•„์‚ฌํ•˜์—ฌ ๊ณต๋ถ€ํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

* ๋ชจ๋“  ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ์ถœ์ฒ˜๋Š” ์œ„ ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.

 


 

์šฐ์œ ๋ฅผ ์ž์ฃผ ์‚ฌ๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์–ด๋–ค ์•„์ดํ…œ์„ ์ถ”์ฒœํ•  ์ˆ˜ ์žˆ์„๊นŒ?

-> ์šฐ์œ ๋ฅผ ์‚ฌ๋Š” ์‚ฌ๋žŒ๋“ค์ด ์ฃผ๋กœ ๋ฌด์—‡์„ ๋” ์‚ฌ๋Š”์ง€ ์•Œ์•„์•ผ ํ•จ

-> ์šฐ์œ ๋ฅผ ์ƒ€๋˜ ์‚ฌ๋žŒ์ด ์–ด๋–ค ์•„์ดํ…œ์„ ๋” ๊ตฌ๋งคํ–ˆ๋Š”์ง€ ํŒจํ„ด์„ ํŒŒ์•…ํ•ด์•ผ ํ•จ

 

 

* Frequent Pattern(FP; ๋นˆ๋ฐœํŒจํ„ด)์ด๋ž€

๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ํŒจํ„ด์œผ๋กœ ์šฐ์œ ์™€ ์‹œ๋ฆฌ์–ผ ๊ฐ™์ด transaction ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” ํ•ญ๋ชฉ์ง‘ํ•ฉ

 

- ๋ฐ์ดํ„ฐ์˜ ์—ฐ๊ด€๊ด€๊ณ„, ์ƒ๊ด€๊ด€๊ณ„, ๋‹ค๋ฅธ ๊ด€์‹ฌ๋Œ€์ƒ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜๋Š”๋ฐ ์žˆ์–ด, ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

- ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜, ๊ตฐ์ง‘ํ™” ๋“ฑ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ์ž‘์—…์—๋„ ๋„์›€์ด ๋œ๋‹ค.

 

 

 


 

 

Frequent Pattern Data Mining

 

- FP ๋งˆ์ด๋‹์ด๋ž€, ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ์žฌ๊ท€๊ด€๊ณ„๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•

 

ex. ์žฅ๋ฐ”๊ตฌ๋‹ˆ ๋ถ„์„

-> ์ง‘์–ด ๋„ฃ๋Š” ํ•ญ๋ชฉ์ด ๋งŽ์„ ๋•Œ, ํ•ญ๋ชฉ ๊ฐ„์˜ ์—ฐ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜์—ฌ, ๊ตฌ๋งค์Šต๊ด€์„ ๋ถ„์„ํ•ด๋‚ด๊ณ ์ž ํ•˜๋Š” ๊ฒƒ

 

 

๊ทœ์น™ ๊ด€์‹ฌ๋„(interestingness)์— ๋Œ€ํ•œ ์ถ•๋„

- ์ง€์ง€๋„(Support)

- ์‹ ๋ขฐ๋„(Confidence)

 

- support๊ฐ€ 2%๋ผ๋ฉด, ๋ถ„์„ ๋Œ€์ƒ ์ „์ฒด ํŠธ๋žœ์žญ์…˜์—์„œ ํŠน์ • ์•„์ดํ…œ์„ ๊ตฌ๋งคํ•œ ๋น„์œจ์ด 2%์ž„

- Computer => anti-virus software์˜ confidence๊ฐ€ 60%๋ผ๋ฉด,

  ์ปดํ“จํ„ฐ๋ฅผ ๊ตฌ๋งคํ•œ ๊ณ ๊ฐ ์ค‘ S/W๋ฅผ ๊ตฌ๋งคํ•œ ๋น„์œจ์ด 60%์ž„์„ ์˜๋ฏธ

 

- ์ผ๋ฐ˜์ ์œผ๋กœ, ์—ฐ๊ด€๊ทœ์น™์ด ์ตœ์†Œ ์ง€์ง€๋„์˜ ์ž„๊ณ„๊ฐ’๊ณผ ์ตœ์†Œ ์‹ ๋ขฐ๋„์˜ ์ž„๊ณ„๊ฐ’์„ ๋งŒ์กฑํ•˜๋ฉด 'ํฅ๋ฏธ๊ฐ€ ์žˆ๋‹ค' ๋ผ๊ณ  ํŒ๋‹จํ•œ๋‹ค.

(์ด ์ตœ์†Œ ์ž„๊ณ„๊ฐ’์ด minimup support/confidence threshold, ์ฆ‰ minsup์™€ minconf ์ธ๋“ฏ!)

 

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

(์—ฌ๊ธฐ์„œ ์™œ ๊ต์ง‘ํ•ฉ์ด ์•„๋‹ˆ๋ผ ํ•ฉ์ง‘ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚œ์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค.)

 

 

 


 

- itemset(ํ•ญ๋ชฉ ์ง‘ํ•ฉ): ์—ฌ๋Ÿฌ ํ•ญ๋ชฉ์˜ ์ง‘ํ•ฉ. ํŠธ๋žœ์žญ์…˜์—์„œ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์š”์†Œ๋“ค์˜ ์ง‘ํ•ฉ์„ ์˜๋ฏธ

์ถœ์ฒ˜: ์œ„ ๋ธ”๋กœ๊ทธ

 

- TID(Transaction ID): ๊ฐ ํŠธ๋žœ์žญ์…˜์€ TID๋ผ๊ณ  ํ•˜๋Š” ์‹๋ณ„์ž๋ฅผ ๊ฐ–๋Š”๋‹ค.

 

- Association rule(์—ฐ๊ด€ ๊ทœ์น™): A=>B

  - itemset A๋Š” ๊ณต์ง‘ํ•ฉ์ด ์•„๋‹ˆ๋ฉฐ, B๋„ ๊ณต์ง‘ํ•ฉ์ด ์•„๋‹ˆ๋‹ค.

  - itemset A์™€ B์˜ ๊ณต์ง‘ํ•ฉ์€ ์—†์œผ๋ฉฐ, ๊ฐ itemset์ด ์ „์ฒด itemset์ธ I์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์— ํฌํ•จ๋œ๋‹ค๋ฅผ ์˜๋ฏธ

 

- ๊ฐ•ํ•œ ๊ทœ์น™: minsup๊ณผ minconf๋ฅผ ๋ชจ๋‘ ๋งŒ์กฑํ•˜๋Š” ๊ทœ์น™

 

- K-itemset(K ํ•ญ๋ชฉ์ง‘ํ•ฉ); K๊ฐœ์˜ ํ•ญ๋ชฉ์„ ๊ฐ–๋Š” itemset

  ex) {์šฐ์œ , ๋นต}: 2-itemset, {์šฐ์œ , ๋นต, ํ† ๋งˆํ† }: 3-itemset

 

- Frequency(๋นˆ๋„): support ๊ฐœ์ˆ˜

  - K-itemset์˜ ๋ฐœ์ƒ ๋นˆ๋„๋Š” K-itemset์ด ๋ฐœ์ƒํ•œ ํŠธ๋žœ์žญ์…˜

 

์ฆ‰,

1. ์ง€์ง€๋„: A=>B,

Support(A=>B) = P(AUB)

 

2. ์‹ ๋ขฐ๋„: A=>B,

Confidence(A=>B) = P(B|A)

 

 

 


 

Association Rule ๋ถ„์„์˜ 2๋‹จ๊ณ„

 

1. ๋ชจ๋“  FP ๋ฐœ๊ฒฌํ•˜๊ธฐ; ์ •์˜์— ์˜ํ•ด ์‚ฌ์ „์— ์„ค์ •ํ•œ minsup๋ณด๋‹ค ๋งŽ์ด ๋ฐœ์ƒํ•˜๋Š” ๋ชจ๋“  ํ•ญ๋ชฉ ์ง‘ํ•ฉ ์ฐพ์•„๋‚ด๊ธฐ

2. FP์—์„œ ๊ฐ•ํ•œ ์—ฐ๊ด€๊ทœ์น™ ์ƒ์„ฑํ•˜๊ธฐ; ์ •์˜์— ์˜ํ•ด minsup๊ณผ minconf๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ทœ์น™ ์ฐพ์•„๋‚ด๊ธฐ

 

๋ชจ๋“  FP์˜ ์ง‘ํ•ฉ์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋น„์šฉ์ด ํฐ ์ž‘์—…์ด๋‹ค. (= ๋ชจ๋“  frequent itemset์„ ์ฐพ๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ )

๋ฐ˜๋ฉด ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ๋น„์šฉ์ด ๋“ ๋‹ค.

-> ๋”ฐ๋ผ์„œ association rule์˜ ์ „์ฒด ์„ฑ๋Šฅ์€ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ ๊ฒฐ์ •๋จ

 

์ด๋Ÿฌํ•œ ์ฒซ ๋‹จ๊ณ„์˜ ๊ณ„์‚ฐ์—์„œ ์• ๋กœ์ ์€ ๋Œ€๊ทœ๋ชจ ํ•ญ๋ชฉ์ง‘๋‹จ์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์œ„์—์„œ ๋งํ•œ ์žฌ๊ท€์ ์ธ ํŒจํ„ด์„ ๋ถ„์„ํ•œ๋‹ค๋Š” ๊ฒƒ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋“  ํ•ญ๋ชฉ์— ๋Œ€ํ•ด ํ•˜์œ„ํŒจํ„ด๋„ ํŒŒ์•…ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์งง์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ FP ์ง‘ํ•ฉ๋„ ์กฐํ•ฉ์„ ๊ฐ–๋Š”๋‹ค.

ex. ๊ธธ์ด๊ฐ€ 100์ธ FP {a1, a2, ..., a100}์€ 100C1์ด๋ฏ€๋กœ 100๊ฐœ์˜ ํ•ญ๋ชฉ์ด,

     2๊ฐœ์˜ ํ•ญ๋ชฉ ์ง‘ํ•ฉ์€ 100C2์ด๋ฏ€๋กœ ์ ์  ๋งŽ์•„์ง„๋‹ค.

-> 100C1 ~ 100C100๊นŒ์ง€ ๋ชจ๋‘ ๋”ํ•˜๋ฉด ์•ฝ 2^100 - 1

(์ „์ฒด itemset์˜ ์›์†Œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๊ฒฐ๊ตญ exponentialํ•˜๊ฒŒ ๋œ๋‹ค.)

 

๋”ฐ๋ผ์„œ, ์ด๋Ÿฌํ•œ ๋น„์šฉ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆํ•ด๋‚ธ ๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ Closed Pattern๊ณผ Max Pattern.

 

 

 


 

 

* Closed Pattern

: An itemset X is closed if X is fequent & there exists no super-pattern Y⊃X, with the same support as X

- dataset D์—์„œ X์™€ ๋™์ผํ•œ support ๊ฐœ์ˆ˜๋ฅผ ๊ฐ–๋Š” super pattern Y๊ฐ€ ์—†๋‹ค๋ฉด, itemset X๋Š” dataset D์—์„œ closed ๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 

ex)

X: {1, 2} = 60%

Y1: {1, 2, 3} = 50%

Y2: {1, 2, 4} = 50% ์ธ ์ƒํ™ฉ์—์„œ minsup์ด 50%์ธ ํŒจํ„ด์„ frequent-pattern์ด๋ผ๊ณ  ์ •์˜ํ•˜์ž.

 

X๋ฅผ ํฌํ•จํ•˜๋Š” ๋” ํฐ pattern์€ Y1, Y2๋ฐ–์— ์—†๋Š”๋ฐ,

Y1, Y2๊ฐ€ X์™€ ๊ฐ™์€ support์ธ super-pattern(Y1, Y2)๊ฐ€ ์—†์œผ๋ฏ€๋กœ X๋ฅผ closed-pattern์ด๋ผ๊ณ  ํ•œ๋‹ค.

์ฆ‰, X์™€ ๊ฐ™์€ support์ธ Y1, Y2๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— closed-pattern์ด๋ผ๊ณ  ํ•จ

 

 

 

* Max Pattern

: An itemset X is a max-pattern if X is frequent & there exists no frequent super-pattern Y⊃X

- X๊ฐ€ frequentํ•˜๊ณ , Y๋„ fequentํ•˜๋ฉฐ, Y๊ฐ€ X๋ฅผ ํฌํ•จํ•˜๋Š” super-pattern Y๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด, itemset๋Š” ์ง‘ํ•ฉ D์—์„œ max-pattern์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

(์ตœ์™ธ๊ฐ ์ง‘ํ•ฉ๋“ค์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ max-pattern์ด๋ผ๊ณ  ํ•˜์ž.

 ์ž๊ธฐ๋Š” frequent, but ์ž๊ธฐ๋ฅผ ํฌํ•จํ•˜๋Š” ์–ด๋–ค superset๋„ frequent x)

 

ex)

X: {1, 2} = 60%

Y1: {1, 2, 3} = 50%

Y2: {1, 2, 4} = 50%

Y2: {1, 2, 3, 4} = 50% ์ธ ์ƒํ™ฉ์—์„œ minsup์ด 50%์ธ pattern์„ frequent patter ์ด๋ผ๊ณ  ์ •์˜ํ•˜์ž.

 

X๊ฐ€ frequent pattern์ด๊ณ , frequent super pattern์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ, Y3์€ max-pattern์ด๋‹ค.

 

 

- ๋ชจ๋“  max-pattern์€ closed-pattern์ด๋‹ค.

 

 

<Exercise>

DB = {<a1, a2, ..., a100>; <a1, a2, ..., a50>}

- min_sup = 1

 

* Closed pattern

- <a1, ..., a100>: 1

- <a1, ..., a50>: 2

 

* Max pattern

- <a1, ..., a100>: 1

 

 

 

 

- FP๋Š” ์ตœ์†Œ ์ง€์ง€๋„(threshold)์— ๋”ฐ๋ผ ๋งค์šฐ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ณ€ํ™”ํ•œ๋‹ค.

- minsup์ด ๋งค์šฐ ๋‚ฎ์œผ๋ฉด, ๊ต‰์žฅํžˆ ๋งŽ์€ ํ•ญ๋ชฉ์ง‘ํ•ฉ์ด ์ƒ์„ฑ๋œ๋‹ค.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments