๊ด€๋ฆฌ ๋ฉ”๋‰ด

yeon's ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 4. EDA - Age ๋ณธ๋ฌธ

Computer ๐Ÿ’ป/๋ฐ์ดํ„ฐ ๋ถ„์„

[kaggle] ํƒ€์ดํƒ€๋‹‰(titanic) | 4. EDA - Age

yeon42 2021. 7. 27. 20:51
728x90

2.4 Age

 

print('์ œ์ผ ๋‚˜์ด ๋งŽ์€ ํƒ‘์Šน๊ฐ : {:.1f} Years'.format(df_train['Age'].max()))
print('์ œ์ผ ์–ด๋ฆฐ ํƒ‘์Šน๊ฐ : {:.1f} Years'.format(df_train['Age'].min()))
print('ํƒ‘์Šน๊ฐ ํ‰๊ท  ๋‚˜์ด : {:.1f} Years'.format(df_train['Age'].mean()))

-> data ๊ฐ’์— ๋Œ€ํ•œ max(), min(), mean() ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ตฌ๋‚˜!

 

 

 

 

  • ์ƒ์กด์— ๋”ฐ๋ฅธ age์˜ histogram

* kdeplot (์ปค๋„๋ฐ€๋„์ถ”์ •)

  - histogram์€ ๋”ฑ๋”ฑํ•œ step์œผ๋กœ ๊ทธ๋ ค์ง„๋‹ค.

  - ์ด๊ฒƒ์„ ํ•˜๋‚˜์˜ ๋ฐ€๋„ํ•จ์ˆ˜๋กœ smoothํ•˜๊ฒŒ ๊ทธ๋ ค์ฃผ์ž

fix, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(df_train[df_train['Survived'] == 1]['Age'], ax=ax)
sns.kdeplot(df_train[df_train['Survived'] == 0]['Age'], ax=ax)
plt.legend(['Survived == 1', 'Survived == 0'])
plt.show()

 

 

 

* ์ฝ”๋“œ ํ•ด์„

df_train['Survived'] == 1

 

df_train[df_train['Survived'] == 1]

    - ์‚ด์•„๋‚จ์€ ์‚ฌ๋žŒ๋งŒ ๋ฐ˜ํ™˜ํ•˜์ž

 

 

df_train[df_train['Survived'] == 1]['Age']

  - ์ƒ์กดํ•œ ์‚ฌ๋žŒ์˜ Age ์ปฌ๋Ÿผ๋งŒ ๊ฐ€์ ธ์˜ค์ž

 

 

sns.kdeplot(df_train[df_train['Survived'] == 1]['Age']

  - ์ด๊ฑธ seaborn์˜ kdeplot ๊ทธ๋ž˜ํ”„ ์•ˆ์— ๋„ฃ์–ด๋ณธ๋‹ค

 

 

df_train[df_train['Survived'] == 1]['Age'].hist()

  - ์ด๊ฑด histogram !

 

 

 

 

 

* ๊ทธ๋ž˜ํ”„์˜ '๋„ํ™”์ง€'๋ฅผ ์ค€๋น„ํ•˜๋Š” 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•

f = plt.figure(figsize=(5, 5))

a = np.arange(100)
b = np.sin(a)

plt.plot(b)
f, ax = plt.subplots(1, 1, figsize=(5, 5))

a = np.arange(100)
b = np.sin(a)

ax.plot(b)
plt.figure(figsize=(5, 5))

a = np.arange(100)
b = np.sin(a)

plt.plot(b)

 

 

 

* ์ƒ์กด ์—†๋Š” ๊ทธ๋ž˜ํ”„

plt.figure(figsize=(8, 6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='kde')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='kde')

plt.xlabel('Age')
plt.title('Age Distribution within classes')
plt.legend(['1st Class', '2nd Class', '3rd Class'])

 

- ์™œ kdeplot์œผ๋กœ ๊ทธ๋ฆฌ๋ƒ ?! -> histogram์€ ๊ฒน์ณ์ ธ์„œ ๋ณด์ด์ง€ ์•Š๋Š”๋‹ค.

plt.figure(figsize=(8, 6))
df_train['Age'][df_train['Pclass'] == 1].plot(kind='hist')
df_train['Age'][df_train['Pclass'] == 2].plot(kind='hist')
df_train['Age'][df_train['Pclass'] == 3].plot(kind='hist')

# ์—ฌ๊ธฐ์„œ๋Š”
plt.xlabel('Age')
plt.title('Age Distribution within classes')
plt.legend(['1st Class', '2nd Class', '3rd Class'])

 

 

* ์ƒ์กด ํ™•๋ฅ ์— ๋”ฐ๋ฅธ ๊ทธ๋ž˜ํ”„

fig, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(df_train[(df_train['Survived'] == 0) & (df_train['Pclass'] == 1)]['Age'], ax=ax)
sns.kdeplot(df_train[(df_train['Survived'] == 1) & (df_train['Pclass'] == 1)]['Age'], ax=ax)
plt.legend(['Survived == 0', 'Survived == 1'])
plt.title('1st class')
plt.show()

--> ์ Š์€ ์‚ฌ๋žŒ์ผ์ˆ˜๋ก ์ƒ์กด ํ™•๋ฅ ์ด ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

* ์ •๋ง ๊ทธ๋Ÿฐ์ง€ ํ™•์ธํ•ด๋ณด๊ธฐ

change_age_range_survival_ratio = []

for i in range(1, 80):
	change_age_range_survival_ratio.append(df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived]))
    
    plt.figure(figsize=(7, 7))
    plt.plot(change_age_range_survival_ratio)
    plt.title('Survival rate change depending on range of Age', y=1.02)
    plt.ylabel('Survival rate')
    plt.xlabel('Range of Age(0-x)')

  - age์˜ ๋ฒ”์œ„๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ–ˆ์„ ๋•Œ (1-80์„ธ) survival ratio๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”๊ฐ€๋ฅผ ๋ณด๊ณ ์‹ถ์Œ

  - i๊ฐ€ 1์„ธ๋ถ€ํ„ฐ 80์„ธ๊นŒ์ง€ ๋ณ€ํ•˜๋Š”๋ฐ, ๋งŒ์•ฝ i=10์ด๋ผ๋ฉด 10์‚ด๋ณด๋‹ค ์ž‘์€ ์• ๋“ค ์ค‘ ๋ช‡ ๋ช…์ด ์‚ด์•˜๋ƒ๋ฅผ ๋ฐ˜ํ™˜ํ•ด์คŒ

  - ์ •๋ง ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ์ƒ์กด ํ™•๋ฅ ์ด ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค!!

 

 


 

 

* ์ฝ”๋“œ ํ•ด์„

i = 10
df_train[df_train['Age'] < i]

  - df_train์˜ age๊ฐ€ 10(์„ธ)๋ณด๋‹ค ์ž‘์€ ์นœ๊ตฌ๋“ค์˜ row๋ฅผ ๋ฐ˜ํ™˜์‹œํ‚ด

 

 

df_train[df_train['Age'] < i]['Survived']

  

  - ๊ทธ ์‚ฌ๋žŒ๋“ค์ด Survived ํ–ˆ๋Š”์ง€ ์•ˆํ–ˆ๋Š”์ง€

 

 

df_train[df_train['Age'] < i]['Survived'].sum()

  - ์ด ์ƒ์กด์„ ๋ช‡ ๋ช…์ด ํ–ˆ๋Š”์ง€

  - Survived ํ•œ ์‚ฌ๋žŒ์ด 1์ด๋ฏ€๋กœ sum์˜ ๊ฒฐ๊ณผ๋Š” ์ด ์ƒ์กด์„ ํ•œ ์‚ฌ๋žŒ์˜ ์ˆ˜!

 

 

 

len(df_train[df_train['Age'] < i]['Survived']

df_train[df_train['Age'] < i]['Survived'].sum() / len(df_train[df_train['Age'] < i]['Survived']

  - ์ด ์‚ฌ๋žŒ์˜ ์ˆ˜๋กœ ๋‚˜๋ˆˆ ํ™•๋ฅ  !! (์šฐ๋ฆฌ๊ฐ€ ๋ณด๊ณ ์‹ถ์€ ๊ฒƒ)

Comments