๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 11. Feature Engineering - Pearson ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/๋ฐ์ดํ„ฐ ๋ถ„์„

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 11. Feature Engineering - Pearson

yeon42 2021. 8. 7. 23:15
728x90

3.3 Change Initial, Embarked and Sex (string to numerical)

 

 

* Initial์„ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ฐ”๊พธ๊ธฐ

 

df_train.Initial.unique()

  - ํ˜„์žฌ Initial์—๋Š” Mr, Mrs, Miss, Master, Other ์ด 5๊ฐœ์˜ ๊ฐ’์ด ๋‹ด๊ฒจ์žˆ๋‹ค.

 

 

 

- ์ด๋Ÿฐ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ํ‘œํ˜„๋œ ๋ฐ์ดํ„ฐ๋ฅผ input์œผ๋กœ ๋„ฃ์–ด์ฃผ๋ ค๋ฉด ์ˆ˜์น˜ํ™” ๊ฐ€ ํ•„์š”

 

df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})

 

 

 

 


 

 

 

* Embarked๋ฅผ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ฐ”๊พธ๊ธฐ

 

 

df_train.Embarked.unique()

numpy์˜ array

 

 

df_train['Embarked'].value_counts()

series

  - Embarked์—๋Š” S, C, Q, ์ด 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

 

 

 

df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

 

 

 

 

 


 

 

 

df_train.Embarked.isnull().any()

  - any() : true๊ฐ€ 1๊ฐœ๋ผ๋„ ์žˆ์œผ๋ฉด True / true๊ฐ€ ํ•œ ๊ฐœ๋„ ์—†์œผ๋ฉด False

  - ์ˆ˜์น˜ ๋ณ€ํ™˜ ์ž˜ ๋˜์—ˆ๊ตฌ๋‚˜ (๋นˆ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค.)

 

 

 

 


 

 

 

 

* Sex ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

 

df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

 

 

 

 

 


 

  • feature๋“ค๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธํ•˜๊ธฐ

* Pearson Correlation

  1 : ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„

  -1 : ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„

  0 : ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†๋‹ค.

 

 

 

- ์šฐ๋ฆฌ๋Š” ์—ฌ๋Ÿฌ feature๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. -> hetmap plot์œผ๋กœ ํ•˜๋‚˜์˜ matrix ํ˜•ํƒœ๋กœ ๋ณด๊ธฐ

 

heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]
colormap = plt.cm.RdBu
plt.figure(figsize=(12, 10))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(heatmap_data.astype(float).corr(), linewidths=1, vmax=1,
			square=True, cmap=colormap, linecolor='Blue', annot=True, annot_kws={'size': 16})

 

- ์ž๊ธฐ ์ž์‹ ๊ณผ์˜ ๊ด€๊ณ„๊ฐ€ ์•„๋‹Œ ๊ฐ’์€ 1๊ณผ -1์˜ ๊ฐ’์ด ์—†๋‹ค.

=> ๋ถˆํ•„์š”ํ•œ redundant ๊ฐ’์ด ์—†๋‹ค.

 (์™œ๋ƒํ•˜๋ฉด, ๋งŒ์•ฝ 1๊ณผ -1์ด๋ผ๋ฉด ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„, ์ฆ‰ ๋‘ ๋ฐ์ดํ„ฐ ์ค‘ ํ•˜๋‚˜๋งŒ ์žˆ์–ด๋„ ๊ดœ์ฐฎ๋‹ค๋Š” ๋œป!!)

Comments