๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 8. EDA - Fare ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/๋ฐ์ดํ„ฐ ๋ถ„์„

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 8. EDA - Fare

yeon42 2021. 8. 7. 17:04
728x90

2. 8 Fare

: ํƒ‘์Šน์š”๊ธˆ

 

 

 

* Skewness(์™œ๋„)

  - ์ž๋ฃŒ์˜ ๋ถ„ํฌ๋ชจ์–‘์ด ํ‰๊ท ์„ ์ค‘์‹ฌ์œผ๋กœ ํ•œ ์ชฝ์œผ๋กœ ์น˜์šฐ์ณ์ ธ ์žˆ๋Š” ๊ฒฝํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ฒ™๋„

  - ์ž๋ฃŒ์˜ ๋ถ„ํฌ๊ฐ€ ๋Œ€์นญ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์ธก์ •ํ•ด์ฃผ๋Š” ๊ฐ’

  - distribution์ด ์–ผ๋งˆ๋‚˜ ์ ๋ ธ๋ƒ (๋น„๋Œ€์นญ์ด๋ƒ)

    - skew = 0 : ์ •๊ทœ๋ถ„ํฌ

    - skew > 0 : ์ขŒ์ธก์œผ๋กœ ์น˜์šฐ์นจ

    - skew < 0 : ์šฐ์ธก์œผ๋กœ ์น˜์šฐ์นจ

 

 

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', label='Skewness: {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

 

  - ๊ทธ๋ž˜ํ”„๊ฐ€ ํ•œ ์ชฝ์œผ๋กœ ๋„ˆ๋ฌด ์น˜์šฐ์ณ์กŒ๋‹ค. -> ์ด๋Œ€๋กœ ๋ชจ๋ธ์— ๋„ฃ์–ด์ค€๋‹ค๋ฉด ์ž์นซ ๋ชจ๋ธ์ด ์ž˜๋ชป ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

  - outlier์˜ ์˜ํ–ฅ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด Fare์— log๋ฅผ ์ทจํ•˜์ž

 

 

 

 

df_train['Fare'] = df_train['Fare'].map(lambda i: np.log(i) if i>0 else 0)

 

 

 

 

 

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
g = sns.distplot(df_train['Fare'], color='b', lable='Skewness: {:.2f}'.format(df_train['Fare'].skew()), ax=ax)
g = g.legend(loc='best')

 

  - log๋ฅผ ์ทจํ•˜๋‹ˆ ๋น„๋Œ€์นญ์„ฑ(skewness)์ด ์ค„์–ด๋“ค์—ˆ๋‹ค.

  -> Feature Engineering : ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด feature๋“ค์„ ์ž„์˜๋กœ ์กฐ์ž‘ํ•˜๋Š” ๊ฒƒ!!

 

 

Comments