์ผ | ์ | ํ | ์ | ๋ชฉ | ๊ธ | ํ |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
- Titanic
- ๋ฅ๋ฌ๋
- ์ํ์ฝ๋ฉ
- ์๋๋ก์ด๋์คํ๋์ค
- ๋ฐ์ดํฐ
- cs231n
- ํ์ดํ๋
- Git
- ์๋ฒ ๋ฉ
- linearalgebra
- ์๊ณ ๋ฆฌ์ฆ
- ์ ํ๋์ํ
- Kaggle
- ๊ฒฐ์ ํธ๋ฆฌ
- c++
- ํ๊ตญ์ด์๋ฒ ๋ฉ
- ์ธํ๋ฐ
- ๋ฐ์ดํฐ๋ถ์
- AI
- react
- ๋ถ์
- ๋ฐ์ดํฐ์๊ฐํ
- ๊นํ
- nlp
- ๋จธ์ ๋ฌ๋
- ๋ค์ดํฐ๋ธ
- ๋ฆฌ์กํธ
- ๋์
- ๋ฐฑ์ค
- native
- Today
- Total
yeon's ๐ฉ๐ป๐ป
[kaggle] ํ์ดํ๋(titanic) | 11. Feature Engineering - Pearson ๋ณธ๋ฌธ
[kaggle] ํ์ดํ๋(titanic) | 11. Feature Engineering - Pearson
yeon42 2021. 8. 7. 23:153.3 Change Initial, Embarked and Sex (string to numerical)
* Initial์ ์์นํ ๋ฐ์ดํฐ๋ก ๋ฐ๊พธ๊ธฐ
df_train.Initial.unique()
- ํ์ฌ Initial์๋ Mr, Mrs, Miss, Master, Other ์ด 5๊ฐ์ ๊ฐ์ด ๋ด๊ฒจ์๋ค.
- ์ด๋ฐ ์นดํ ๊ณ ๋ฆฌ๋ก ํํ๋ ๋ฐ์ดํฐ๋ฅผ input์ผ๋ก ๋ฃ์ด์ฃผ๋ ค๋ฉด ์์นํ ๊ฐ ํ์
df_train['Initial'] = df_train['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
df_test['Initial'] = df_test['Initial'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Other': 4})
* Embarked๋ฅผ ์์นํ ๋ฐ์ดํฐ๋ก ๋ฐ๊พธ๊ธฐ
df_train.Embarked.unique()
df_train['Embarked'].value_counts()
- Embarked์๋ S, C, Q, ์ด 3๊ฐ์ ๋ฐ์ดํฐ๊ฐ ์กด์ฌํ๋ค.
df_train['Embarked'] = df_train['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_test['Embarked'] = df_test['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df_train.Embarked.isnull().any()
- any() : true๊ฐ 1๊ฐ๋ผ๋ ์์ผ๋ฉด True / true๊ฐ ํ ๊ฐ๋ ์์ผ๋ฉด False
- ์์น ๋ณํ ์ ๋์๊ตฌ๋ (๋น ๋ฐ์ดํฐ๊ฐ ์๋ค.)
* Sex ๋ฐ์ดํฐ๋ฅผ ์์นํ์ผ๋ก ๋ณํํ๊ธฐ
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})
- feature๋ค๊ฐ์ ์๊ด๊ด๊ณ ํ์ธํ๊ธฐ
* Pearson Correlation
1 : ์์ ์๊ด๊ด๊ณ
-1 : ์์ ์๊ด๊ด๊ณ
0 : ์๊ด๊ด๊ณ๊ฐ ์๋ค.
- ์ฐ๋ฆฌ๋ ์ฌ๋ฌ feature๋ค์ ๊ฐ์ง๊ณ ์๋ค. -> hetmap plot์ผ๋ก ํ๋์ matrix ํํ๋ก ๋ณด๊ธฐ
heatmap_data = df_train[['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilySize', 'Initial', 'Age_cat']]
colormap = plt.cm.RdBu
plt.figure(figsize=(12, 10))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(heatmap_data.astype(float).corr(), linewidths=1, vmax=1,
square=True, cmap=colormap, linecolor='Blue', annot=True, annot_kws={'size': 16})
- ์๊ธฐ ์์ ๊ณผ์ ๊ด๊ณ๊ฐ ์๋ ๊ฐ์ 1๊ณผ -1์ ๊ฐ์ด ์๋ค.
=> ๋ถํ์ํ redundant ๊ฐ์ด ์๋ค.
(์๋ํ๋ฉด, ๋ง์ฝ 1๊ณผ -1์ด๋ผ๋ฉด ๊ฐํ ์๊ด๊ด๊ณ, ์ฆ ๋ ๋ฐ์ดํฐ ์ค ํ๋๋ง ์์ด๋ ๊ด์ฐฎ๋ค๋ ๋ป!!)