๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 10. Feature Engineering - Embarked Feature ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/๋ฐ์ดํ„ฐ ๋ถ„์„

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 10. Feature Engineering - Embarked Feature

yeon42 2021. 8. 7. 22:01
728x90

3.1.2 Fill Null in Embarked

 

df_train['Embarked'].isnull().sum()

 

  - Embarked์˜ null data๋Š” ํ˜„์žฌ 2๊ฐœ์ด๋‹ค.

  -> ์ด null data๋“ค์„ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ฒ ๋‹ค !

 

- ํ˜„์žฌ S์—์„œ ๊ฐ€์žฅ ๋งŽ์€ ํƒ‘์Šน๊ฐ์ด ์žˆ์œผ๋ฏ€๋กœ, nulld data๋ฅผ S๋กœ ์ฑ„์šฐ๊ฒ ๋‹ค.

 

 

 

 

 

 

  • fillna
df_train['Embarked'.fillna('S', inplace=True)

 

 

 

 


 

 

3.2 Change Age (continuous to categorical)

 

- Age๋Š” ํ˜„์žฌ continuous feature

- Age๋ฅผ ๋ช‡ ๊ฐœ์˜ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์–ด categoryํ™” ์‹œ์ผœ์ฃผ์ž

 

 

 

 

  • ์ƒˆ๋กœ์šด column ์ƒ์„ฑ
df_train['Age_cat'] = 0

 

 

 

 

 

 

  • loc์„ ์‚ฌ์šฉํ•ด 10์‚ด ๊ฐ„๊ฒฉ์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ
df_train.loc[df_train['Age'] < 10, 'Age_cat'] = 0
df_train.loc[(10 <= df_train['Age']) & (df_train['Age'] < 20), 'Age_cat'] = 1
df_train.loc[(20 <= df_train['Age']) & (df_train['Age'] < 30), 'Age_cat'] = 2
df_train.loc[(30 <= df_train['Age']) & (df_train['Age'] < 40), 'Age_cat'] = 3
df_train.loc[(40 <= df_train['Age']) & (df_train['Age'] < 50), 'Age_cat'] = 4
df_train.loc[(50 <= df_train['Age']) & (df_train['Age'] < 60), 'Age_cat'] = 5
df_train.loc[(60 <= df_train['Age']) & (df_train['Age'] < 70), 'Age_cat'] = 6
df_train.loc[(70 <= df_train['Age']), 'Age_cat'] = 7

 

df_test.loc[df_test['Age'] < 10, 'Age_cat'] = 0
df_test.loc[(10 <= df_test['Age']) & (df_test['Age'] < 20), 'Age_cat'] = 1
df_test.loc[(20 <= df_test['Age']) & (df_test['Age'] < 30), 'Age_cat'] = 2
df_test.loc[(30 <= df_test['Age']) & (df_test['Age'] < 40), 'Age_cat'] = 3
df_test.loc[(40 <= df_test['Age']) & (df_test['Age'] < 50), 'Age_cat'] = 4
df_test.loc[(50 <= df_test['Age']) & (df_test['Age'] < 60), 'Age_cat'] = 5
df_test.loc[(60 <= df_test['Age']) & (df_test['Age'] < 70), 'Age_cat'] = 6
df_test.loc[(70 <= df_test['Age']), 'Age_cat'] = 7

 

 

 

 


 

 

but, ์œ„๊ฐ™์€ ํ•˜๋“œ์ฝ”๋”ฉ์€ ์–ด๋ ต๋‹ค.

-> apply ๋ผ๋Š” ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด ํ•จ์ˆ˜๋ฅผ ์ƒ์„ฑํ•ด ์กฐ๊ธˆ ๋” ์‰ฝ๊ฒŒ ํ•ด๋ณด์ž.

 

 

 

 

def category_age(x):
	if x<10:
    	return 0
    elif x<20:
    	return 1
    elif x<30:
    	return 2
    elif x<40:
        return 3
    elif x<50:
        return 4
    elif x<60:
        return 5
    elif x<70:
        return 6
    else:
        return 7
df_train['Age_cat_2'] = df_train['Age'].apply(category_age)

 

 

 

 


 

 

 

  • ๋‘˜์„ ๋น„๊ตํ•ด๋ณด์ž !
(df_train['Age_cat'] == df_train['Age_cat_2']).all()

 

  - all() : ๋ชจ๋“ ๊ฒŒ true์ผ ๋•Œ๋งŒ T, ํ•˜๋‚˜๋ผ๋„ false์ด๋ฉด F

 

 

 

 

 

 

  • ์“ธ๋ฐ์—†๋Š” ๊ฒƒ์€ ๋‚ ๋ฆฌ๊ธฐ
df_train.drop(['Age', 'Age_cat_2'], axis=1, inplace=True)
df_test.drop(['Age'], axis=1, inplace=True)

  - axis=1 : ์ปฌ๋Ÿผ์ด ๋‚ ์•„๊ฐ

Comments