์ผ | ์ | ํ | ์ | ๋ชฉ | ๊ธ | ํ |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
- ๋์
- ๋ฐ์ดํฐ์๊ฐํ
- ๊นํ
- ๋จธ์ ๋ฌ๋
- ๋ฐฑ์ค
- nlp
- Kaggle
- ์ธํ๋ฐ
- AI
- ์๋๋ก์ด๋์คํ๋์ค
- ๋ค์ดํฐ๋ธ
- ๋ฅ๋ฌ๋
- ์๊ณ ๋ฆฌ์ฆ
- ์๋ฒ ๋ฉ
- ์ํ์ฝ๋ฉ
- native
- ๋ฆฌ์กํธ
- ๋ฐ์ดํฐ๋ถ์
- ๋ถ์
- react
- c++
- cs231n
- ํ๊ตญ์ด์๋ฒ ๋ฉ
- Titanic
- linearalgebra
- ์ ํ๋์ํ
- Git
- ํ์ดํ๋
- ๊ฒฐ์ ํธ๋ฆฌ
- ๋ฐ์ดํฐ
- Today
- Total
yeon's ๐ฉ๐ป๐ป
[kaggle] ํ์ดํ๋(titanic) | 9. Feature Engineering - Fill Null in Age ๋ณธ๋ฌธ
[kaggle] ํ์ดํ๋(titanic) | 9. Feature Engineering - Fill Null in Age
yeon42 2021. 8. 7. 17:303. Feature Engineering
- dataset์ ์กด์ฌํ๋ null data ์ฑ์ฐ๊ธฐ
-> null data๋ฅผ ํฌํจํ๋ feature๋ค์ statistics๋ฅผ ์ฐธ๊ณ ํ์ฌ ์ฑ์๋ณด์
- Feature Engineering์ ์ค์ ๋ชจ๋ธ์ ํ์ต์ ์ฐ๋ ค๊ณ ํ๋ ๊ฒ์ด๋ฏ๋ก, train ๋ฟ๋ง ์๋๋ผ test์๋ ๋๊ฐ์ด ์ ์ฉํด์ค์ผ ํจ!
3.1 Fill Null
df_train['Age'].isnull().sum()
- Age์๋ 177๊ฐ์ null data๊ฐ ์กด์ฌํ๋ค.
-> title(Mr., Miss., ...)๋ฅผ ์ฌ์ฉํด null data๋ฅผ ์ฑ์์ฃผ์ !
df_train['Name']
- Name์์ title๋ง ์ถ์ถํ์
df_train['Initial'] = df_train['Name'].str.extract('([A-Za-z]+)\.')
df_test['Initial'] = df_test['Name'].str.extract('([A-Za-z]+)\.')
- pandas sereis์๋ data๋ฅผ string์ผ๋ก ๋ฐ๊ฟ์ฃผ๋ str ๋ฉ์๋๊ฐ ์กด์ฌํ๋ค.
+ ์ ๊ทํํ์์ ์ ์ฉํ๊ฒ ํด์ฃผ๋ extract ๋ฉ์๋ ๋ฅผ ์ด์ฉํด title์ ์ถ์ถํ๊ธฐ
pd.crosstab(df_train['Initial'], df_train['Sex]).T.style.background_gradient(camp='summer_r')
- ์ ํ ์ด๋ธ์ ์ด์ฉํด ๋จ์, ์ฌ์๊ฐ ์ฐ๋ initial ์ ๊ตฌ๋ถํด๋ณด์
df_train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady',
'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt',
'Sir', 'Don', 'Dona'],
['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs',
'Mrs', 'Other', 'Other', 'Other', 'Mr',
'Mr', 'Mr', 'Mr'], inplace=True)
df_test['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady',
'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt',
'Sir', 'Don', 'Dona'],
['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs',
'Mrs', 'Other', 'Other', 'Other', 'Mr',
'Mr', 'Mr', 'Mr'], inplace=True)
- replace : ํน์ ๋ฐ์ดํฐ ๊ฐ์ ์ํ๋ ๊ฐ์ผ๋ก ์นํํด์ค
- inplace=True : ๋ณ์์ ์ฌํ ๋น ์ ํด์ค๋ ๋ฐ๋ก ์ ์ฉ ๋จ
df_train.groupby('Initial').mean()
- Mrs & Miss์ Survived ์์กด๋ฅ ์ด ๋๋ค.
-> ์ฌ์ฑ์ ์์กด๋ฅ ์ด ๋์์ ํ์ธํ ์ ์๋ค.
- ๊ทธ๋ฃน๋ณ๋ก ์์กด๋ฅ ์ ํ์ธํด๋ณด์
df_train.groupby('Initial')['Survived'].mean()
df_train.groupby('initial')['Survived'].mean().plot.bar()
- ์ด์ ์๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ณธ๊ฒฉ์ ์ผ๋ก null data๋ฅผ ์ฑ์๋ณด์
- ์ฐ๋ฆฌ๋ train์์ ์ป์ statistics๋ฅผ ๊ธฐ๋ฐ์ผ๋ก test์ null data๋ฅผ ์ฑ์ธ ๊ฒ์ด๋ค.
df_train.groupby('Initial').mean()
- Age์ ํ๊ท ์ ์ด์ฉํด null data ์ฑ์ฐ์
df_train.loc[(df_train['Age'].isnull()) & (df_train['Initial'] == 'Mr'), 'Age'= 33
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Mrs'), 'Age'] = 36
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Master'), 'Age'] = 5
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Miss'), 'Age'] = 22
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Other'), 'Age'] = 46
df_test.loc[(df_test['Age'].isnull()) & (df_train['initial'] == 'Mr'), 'Age'] = 33
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Mrs'), 'Age'] = 36
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Master'), 'Age'] = 5
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Miss'), 'Age'] = 22
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Other'), 'Age'] = 46
'Computer ๐ป > ๋ฐ์ดํฐ ๋ถ์' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[kaggle] ํ์ดํ๋(titanic) | 11. Feature Engineering - Pearson (0) | 2021.08.07 |
---|---|
[kaggle] ํ์ดํ๋(titanic) | 10. Feature Engineering - Embarked Feature (0) | 2021.08.07 |
[kaggle] ํ์ดํ๋(titanic) | 8. EDA - Fare (0) | 2021.08.07 |
[ํ๋์ฐจ์ด์ฆ ์ ์ ๋ถ์] *์์ฝ (0) | 2021.08.07 |
[ํ๋์ฐจ์ด์ฆ ์ ์ ๋ถ์] Folium | MarkerCluster (0) | 2021.08.05 |