๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 9. Feature Engineering - Fill Null in Age ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/๋ฐ์ดํ„ฐ ๋ถ„์„

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 9. Feature Engineering - Fill Null in Age

yeon42 2021. 8. 7. 17:30
728x90

3. Feature Engineering

 

- dataset์— ์กด์žฌํ•˜๋Š” null data ์ฑ„์šฐ๊ธฐ

-> null data๋ฅผ ํฌํ•จํ•˜๋Š” feature๋“ค์˜ statistics๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ฑ„์›Œ๋ณด์ž

- Feature Engineering์€ ์‹ค์ œ ๋ชจ๋ธ์˜ ํ•™์Šต์— ์“ฐ๋ ค๊ณ  ํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ, train ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ test์—๋„ ๋˜‘๊ฐ™์ด ์ ์šฉํ•ด์ค˜์•ผ ํ•จ!

 

 

 

 

 

3.1 Fill Null

 

df_train['Age'].isnull().sum()

  - Age์—๋Š” 177๊ฐœ์˜ null data๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  -> title(Mr., Miss., ...)๋ฅผ ์‚ฌ์šฉํ•ด null data๋ฅผ ์ฑ„์›Œ์ฃผ์ž !

 

 

 

 

 

 

df_train['Name']

  - Name์—์„œ title๋งŒ ์ถ”์ถœํ•˜์ž

 

 

 

 

 

df_train['Initial'] = df_train['Name'].str.extract('([A-Za-z]+)\.')
df_test['Initial'] = df_test['Name'].str.extract('([A-Za-z]+)\.')

- pandas sereis์—๋Š” data๋ฅผ string์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” str ๋ฉ”์†Œ๋“œ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

+ ์ •๊ทœํ‘œํ˜„์‹์„ ์ ์šฉํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” extract ๋ฉ”์†Œ๋“œ ๋ฅผ ์ด์šฉํ•ด title์„ ์ถ”์ถœํ•˜๊ธฐ

 

 

 

 

 

 

pd.crosstab(df_train['Initial'], df_train['Sex]).T.style.background_gradient(camp='summer_r')

  - ์œ„ ํ…Œ์ด๋ธ”์„ ์ด์šฉํ•ด ๋‚จ์ž, ์—ฌ์ž๊ฐ€ ์“ฐ๋Š” initial ์„ ๊ตฌ๋ถ„ํ•ด๋ณด์ž

 

 

 

 

 

df_train['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 
                            'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 
                           'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 
                            'Mrs', 'Other', 'Other', 'Other', 'Mr', 
                            'Mr', 'Mr', 'Mr'], inplace=True)

df_test['Initial'].replace(['Mlle', 'Mme', 'Ms', 'Dr', 'Major', 'Lady', 
                            'Countess', 'Jonkheer', 'Col', 'Rev', 'Capt', 
                           'Sir', 'Don', 'Dona'],
                            ['Miss', 'Miss', 'Miss', 'Mr', 'Mr', 'Mrs', 
                            'Mrs', 'Other', 'Other', 'Other', 'Mr', 
                            'Mr', 'Mr', 'Mr'], inplace=True)

  - replace : ํŠน์ • ๋ฐ์ดํ„ฐ ๊ฐ’์„ ์›ํ•˜๋Š” ๊ฐ’์œผ๋กœ ์น˜ํ™˜ํ•ด์คŒ

  - inplace=True : ๋ณ€์ˆ˜์— ์žฌํ• ๋‹น ์•ˆ ํ•ด์ค˜๋„ ๋ฐ”๋กœ ์ ์šฉ ๋จ

 

 

 

 

 

df_train.groupby('Initial').mean()

  - Mrs & Miss์˜ Survived ์ƒ์กด๋ฅ ์ด ๋†’๋‹ค.

  -> ์—ฌ์„ฑ์˜ ์ƒ์กด๋ฅ ์ด ๋†’์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 


 

 

 

  • ๊ทธ๋ฃน๋ณ„๋กœ ์ƒ์กด๋ฅ ์„ ํ™•์ธํ•ด๋ณด์ž
df_train.groupby('Initial')['Survived'].mean()

 

 

 

 

 

df_train.groupby('initial')['Survived'].mean().plot.bar()

 

 

 

 


 

 

 

- ์ด์ œ ์œ„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณธ๊ฒฉ์ ์œผ๋กœ null data๋ฅผ ์ฑ„์›Œ๋ณด์ž

- ์šฐ๋ฆฌ๋Š” train์—์„œ ์–ป์€ statistics๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ test์˜ null data๋ฅผ ์ฑ„์šธ ๊ฒƒ์ด๋‹ค.

 

 

 

 

 

 

df_train.groupby('Initial').mean()

 

  - Age์˜ ํ‰๊ท ์„ ์ด์šฉํ•ด null data ์ฑ„์šฐ์ž

 

 

 

 

 

df_train.loc[(df_train['Age'].isnull()) & (df_train['Initial'] == 'Mr'), 'Age'= 33
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Mrs'), 'Age'] = 36
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Master'), 'Age'] = 5
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Miss'), 'Age'] = 22
df_train.loc[(df_train['Age'].isnull()) & (df_train['initial'] == 'Other'), 'Age'] = 46

 

df_test.loc[(df_test['Age'].isnull()) & (df_train['initial'] == 'Mr'), 'Age'] = 33
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Mrs'), 'Age'] = 36
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Master'), 'Age'] = 5
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Miss'), 'Age'] = 22
df_test.loc[(df_test['Age'].isnull()) & (df_test['initial'] == 'Other'), 'Age'] = 46
Comments