KT AIVLE/Daily Review

240920

bestone888 2024. 9. 22. 03:23

240920

데이터 분석 순서 복습

1. 라이브러리 불러오기

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2. 파일 읽어오기

In [48]:
employee = pd.read_csv('https://bit.ly/HRDBEmployee')
display(employee.head())

vacation = pd.read_csv('https://bit.ly/HRDBVacation')
display(vacation.head())
EmpIDEmpNameEngNameGenderMaritalYNHireDateRetireDateDeptIDPhoneEMailSalary01234
S0001 홍길동 hong M Y 2017-01-01 NaN SYS 010-1234-1234 hong@d-friends.co.kr 8500.0
S0002 일지매 jiemae M Y 2017-01-12 NaN GEN 017-111-1222 jimae@d-friends.co.kr 8300.0
S0003 강우동 NaN M N 2018-04-01 NaN SYS 010-1222-2221 woodong@d-friends.co.kr 6500.0
S0004 김삼순 samsam F Y 2018-08-01 NaN MKT 010-3212-3212 samsoon.kim@d-friends.co.kr 7000.0
S0005 오삼식 three M N 2019-01-01 2021-01-31 MKT 010-9876-5432 samsik@d-friends.co.kr 6400.0
VacationIDEmpIDBeginDateEndDateReasonDuration01234
1 S0001 2017-01-12 2017-01-12 감기몸살 1
2 S0001 2017-03-21 2017-03-21 글쎄요 1
3 S0001 2017-06-13 2017-06-14 글쎄요 2
4 S0001 2017-07-07 2017-07-07 중요 행사 준비 1
5 S0002 2017-07-21 2017-07-25 놀고싶어서 5

3. 데이터 탐색

In [50]:
employee.info()
print()
vacation.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   EmpID       20 non-null     object 
 1   EmpName     20 non-null     object 
 2   EngName     14 non-null     object 
 3   Gender      20 non-null     object 
 4   MaritalYN   20 non-null     object 
 5   HireDate    20 non-null     object 
 6   RetireDate  4 non-null      object 
 7   DeptID      20 non-null     object 
 8   Phone       20 non-null     object 
 9   EMail       20 non-null     object 
 10  Salary      19 non-null     float64
dtypes: float64(1), object(10)
memory usage: 1.8+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   VacationID  102 non-null    int64 
 1   EmpID       102 non-null    object
 2   BeginDate   102 non-null    object
 3   EndDate     102 non-null    object
 4   Reason      102 non-null    object
 5   Duration    102 non-null    int64 
dtypes: int64(2), object(4)
memory usage: 4.9+ KB

4. 데이터 조회

In [52]:
# 필요한 열만 선택, emp 데이터 프레임으로 선언
my_cols = ['EmpID', 'EmpName', 'EngName', 'Gender', 'HireDate' ,'DeptID', 'Phone', 'Salary']
emp = employee.loc[:, my_cols].copy()

emp.head()
Out[52]:
EmpIDEmpNameEngNameGenderHireDateDeptIDPhoneSalary01234
S0001 홍길동 hong M 2017-01-01 SYS 010-1234-1234 8500.0
S0002 일지매 jiemae M 2017-01-12 GEN 017-111-1222 8300.0
S0003 강우동 NaN M 2018-04-01 SYS 010-1222-2221 6500.0
S0004 김삼순 samsam F 2018-08-01 MKT 010-3212-3212 7000.0
S0005 오삼식 three M 2019-01-01 MKT 010-9876-5432 6400.0

5. 전처리

In [54]:
emp.rename(columns = {'EngName': 'NickName'}, inplace = True)
emp.head()
Out[54]:
EmpIDEmpNameNickNameGenderHireDateDeptIDPhoneSalary01234
S0001 홍길동 hong M 2017-01-01 SYS 010-1234-1234 8500.0
S0002 일지매 jiemae M 2017-01-12 GEN 017-111-1222 8300.0
S0003 강우동 NaN M 2018-04-01 SYS 010-1222-2221 6500.0
S0004 김삼순 samsam F 2018-08-01 MKT 010-3212-3212 7000.0
S0005 오삼식 three M 2019-01-01 MKT 010-9876-5432 6400.0
In [55]:
emp.isnull().sum()
Out[55]:
EmpID       0
EmpName     0
NickName    6
Gender      0
HireDate    0
DeptID      0
Phone       0
Salary      1
dtype: int64
In [73]:
# 결측치 처리
emp['NickName'] = emp['NickName'].fillna('')

emp['Salary'] = emp['Salary'].fillna(500).astype(int)
emp
Out[73]:
EmpIDEmpNameNickNameGenderHireDateDeptIDPhoneSalary012345678910111213141516171819
S0001 홍길동 hong M 2017-01-01 SYS 010-1234-1234 8500
S0002 일지매 jiemae M 2017-01-12 GEN 017-111-1222 8300
S0003 강우동   M 2018-04-01 SYS 010-1222-2221 6500
S0004 김삼순 samsam F 2018-08-01 MKT 010-3212-3212 7000
S0005 오삼식 three M 2019-01-01 MKT 010-9876-5432 6400
S0006 김치국 kimchi M 2019-03-01 HRD 010-8765-8765 6000
S0007 안경태   M 2019-05-01 ACC 017-543-3456 6000
S0008 박여사 parks F 2019-08-01 HRD 010-2345-5432 6300
S0009 최사모 samoya F 2019-10-01 SYS 011-899-9988 5800
S0010 정효리   F 2020-01-01 MKT 010-9988-9900 5000
S0011 오감자 fivegamja M 2020-02-01 SYS 010-6655-7788 4700
S0012 최일국 ilgook M 2020-02-01 GEN 010-8703-7123 6500
S0013 한국인 korea M 2020-04-01 SYS 010-6611-1266 4500
S0014 이최고 first M 2020-04-01 MKT 010-2345-9886 5000
S0015 박치기   M 2020-06-01 MKT 010-8800-0010 4700
S0016 한사랑 onelove F 2020-06-01 HRD 010-3215-0987 7200
S0017 나도야 nado M 2021-12-01 ACC 010-3399-9933 4000
S0018 이리와   M 2022-01-01 HRD 010-5521-1252 5300
S0019 정주고   M 2022-01-01 SYS 010-7777-2277 6000
S0020 고소해 gogo F 2022-06-01 STG 010-9966-1230 500
In [79]:
# 날짜 분리 방법 2가지
# 방법 1
emp1 = emp.copy()
emp1['HireYear'] = emp1['HireDate'].str.split('-').str[0].astype(int)
emp1['HireMonth'] = emp1['HireDate'].str.split('-').str[1].astype(int)
emp1['HireDay'] = emp1['HireDate'].str.split('-').str[2].astype(int)

emp1.drop(columns = ['HireDate'], inplace = True)

emp1
Out[79]:
EmpIDEmpNameNickNameGenderDeptIDPhoneSalaryHireYearHireMonthHireDay012345678910111213141516171819
S0001 홍길동 hong M SYS 010-1234-1234 8500 2017 1 1
S0002 일지매 jiemae M GEN 017-111-1222 8300 2017 1 12
S0003 강우동   M SYS 010-1222-2221 6500 2018 4 1
S0004 김삼순 samsam F MKT 010-3212-3212 7000 2018 8 1
S0005 오삼식 three M MKT 010-9876-5432 6400 2019 1 1
S0006 김치국 kimchi M HRD 010-8765-8765 6000 2019 3 1
S0007 안경태   M ACC 017-543-3456 6000 2019 5 1
S0008 박여사 parks F HRD 010-2345-5432 6300 2019 8 1
S0009 최사모 samoya F SYS 011-899-9988 5800 2019 10 1
S0010 정효리   F MKT 010-9988-9900 5000 2020 1 1
S0011 오감자 fivegamja M SYS 010-6655-7788 4700 2020 2 1
S0012 최일국 ilgook M GEN 010-8703-7123 6500 2020 2 1
S0013 한국인 korea M SYS 010-6611-1266 4500 2020 4 1
S0014 이최고 first M MKT 010-2345-9886 5000 2020 4 1
S0015 박치기   M MKT 010-8800-0010 4700 2020 6 1
S0016 한사랑 onelove F HRD 010-3215-0987 7200 2020 6 1
S0017 나도야 nado M ACC 010-3399-9933 4000 2021 12 1
S0018 이리와   M HRD 010-5521-1252 5300 2022 1 1
S0019 정주고   M SYS 010-7777-2277 6000 2022 1 1
S0020 고소해 gogo F STG 010-9966-1230 500 2022 6 1
In [83]:
# 방법2 : pd.to_datetime함수 사용
emp2 = emp.copy()

emp2['HireYear'] = pd.to_datetime(emp2['HireDate']).dt.year
emp2['HireMonth'] = pd.to_datetime(emp2['HireDate']).dt.month
emp2['HireDay'] = pd.to_datetime(emp2['HireDate']).dt.day

emp2.drop(columns = ['HireDate'], inplace = True)

emp2
Out[83]:
EmpIDEmpNameNickNameGenderDeptIDPhoneSalaryHireYearHireMonthHireDay012345678910111213141516171819
S0001 홍길동 hong M SYS 010-1234-1234 8500 2017 1 1
S0002 일지매 jiemae M GEN 017-111-1222 8300 2017 1 12
S0003 강우동   M SYS 010-1222-2221 6500 2018 4 1
S0004 김삼순 samsam F MKT 010-3212-3212 7000 2018 8 1
S0005 오삼식 three M MKT 010-9876-5432 6400 2019 1 1
S0006 김치국 kimchi M HRD 010-8765-8765 6000 2019 3 1
S0007 안경태   M ACC 017-543-3456 6000 2019 5 1
S0008 박여사 parks F HRD 010-2345-5432 6300 2019 8 1
S0009 최사모 samoya F SYS 011-899-9988 5800 2019 10 1
S0010 정효리   F MKT 010-9988-9900 5000 2020 1 1
S0011 오감자 fivegamja M SYS 010-6655-7788 4700 2020 2 1
S0012 최일국 ilgook M GEN 010-8703-7123 6500 2020 2 1
S0013 한국인 korea M SYS 010-6611-1266 4500 2020 4 1
S0014 이최고 first M MKT 010-2345-9886 5000 2020 4 1
S0015 박치기   M MKT 010-8800-0010 4700 2020 6 1
S0016 한사랑 onelove F HRD 010-3215-0987 7200 2020 6 1
S0017 나도야 nado M ACC 010-3399-9933 4000 2021 12 1
S0018 이리와   M HRD 010-5521-1252 5300 2022 1 1
S0019 정주고   M SYS 010-7777-2277 6000 2022 1 1
S0020 고소해 gogo F STG 010-9966-1230 500 2022 6 1

6. 집계

In [91]:
# 입사연도 별 직원 수
hire_cnt = emp1.groupby(['HireYear'], as_index = False)[['EmpID']].count()
hire_cnt

# 열 이름 변경
hire_cnt.rename(columns = {'HireYear': 'Year', 'EmpID': 'Count'}, inplace = True)
hire_cnt
Out[91]:
YearCount012345
2017 2
2018 2
2019 5
2020 7
2021 1
2022 3

7. 시각화

In [96]:
plt.figure(figsize = (5,3))
plt.bar(hire_cnt['Year'], hire_cnt['Count'])
plt.show()
In [ ]:
 

데이터프레임 변경 (3)

  • 데이터프레임 합치기
  • concat 함수
  • merge 함수
In [107]:
# 결측치 자료형 : float
type(np.nan)
Out[107]:
float
In [122]:
path = 'https://raw.githubusercontent.com/Jangrae/csv/master/seoul_pop_h01.csv'
# 인덱스 설정 방법 1
pop01 = pd.read_csv(path, index_col = 'year')
pop01.index.name = None

display(pop01.head())
# 인덱스 설정 방법 2
pop02 = pd.read_csv(path)
pop02.set_index('year', inplace = True)
pop02.index.name = None
display(pop02.head())
k_malek_female19811982198319841985
4160 4191
4160 4191
4160 4191
4160 4191
4160 4191
k_malek_female19811982198319841985
4160 4191
4160 4191
4160 4191
4160 4191
4160 4191
In [125]:
path = 'https://raw.githubusercontent.com/Jangrae/csv/master/seoul_pop_h02.csv'
pop02 = pd.read_csv(path, index_col = 'year')
pop02.index.name = None

pop02.head()
Out[125]:
f_malef_female19851986198719881989
7 6
7 5
6 5
5 5
6 5
  • 열 합치기(concat함수)
  • 가로로 합 : axis = 1
  • 세로로 합 : axis = 0
  • join : 방법(outer, inner)
In [151]:
# outer
pop = pd.concat([pop01, pop02], join = 'outer', axis = 0)
display(pop.head())


# inner
pop = pd.concat([pop01, pop02],join = 'inner', axis = 1)
display(pop.head())
k_malek_femalef_malef_female19811982198319841985
4160.0 4191.0 NaN NaN
4160.0 4191.0 NaN NaN
4160.0 4191.0 NaN NaN
4160.0 4191.0 NaN NaN
4160.0 4191.0 NaN NaN
k_malek_femalef_malef_female19851986198719881989
4160 4191 7 6
4899 4888 7 5
5000 4979 6 5
5156 5120 5 5
5305 5261 6 5
  • merge 함수
  • how : 방법(inner, outer, left, right)
  • on : join을 어떤 열을 기준으로 할 것인가
In [158]:
path = 'https://raw.githubusercontent.com/Jangrae/csv/master/seoul_pop_h01.csv'
pop01 = pd.read_csv(path)

path = 'https://raw.githubusercontent.com/Jangrae/csv/master/seoul_pop_h02.csv'
pop02 = pd.read_csv(path)
In [167]:
pop = pd.merge(pop01, pop02, how = 'outer', on = 'year')
pop.head()
Out[167]:
yeark_malek_femalef_malef_female01234
1981 4160 4191 NaN NaN
1982 4160 4191 NaN NaN
1983 4160 4191 NaN NaN
1984 4160 4191 NaN NaN
1985 4160 4191 7.0 6.0

데이터프레임 변경 (4)

rolling()

  • rolling() 메서드로 일정 기간에 대한 집계
  • window : 몇 개를 기준으로 작업할 것인가
  • min_periods : 몇 개 이상일 때부터 작업할 것인가
In [198]:
path = 'https://raw.githubusercontent.com/Jangrae/csv/master/airquality.csv'
air = pd.read_csv(path)

air.head()
Out[198]:
OzoneSolar.RWindTempMonthDay01234
41.0 190.0 7.4 67 5 1
36.0 118.0 8.0 72 5 2
12.0 149.0 12.6 74 5 3
18.0 313.0 11.5 62 5 4
NaN NaN 14.3 56 5 5
In [202]:
# OZ_mean 열 추가
air['OZ_mean'] = air['Ozone'].mean()
air['OZ_mean'] = round(air['OZ_mean'],1)
air.head()
Out[202]:
OzoneSolar.RWindTempMonthDayOZ_mean01234
41.0 190.0 7.4 67 5 1 42.1
36.0 118.0 8.0 72 5 2 42.1
12.0 149.0 12.6 74 5 3 42.1
18.0 313.0 11.5 62 5 4 42.1
NaN NaN 14.3 56 5 5 42.1
In [208]:
# 3일간의 오존 평균 열 추가 OZ_mean3
air['OZ_mean3'] = air['Ozone'].rolling(window = 3, min_periods = 1).mean()
air['OZ_mean3'] = round(air['OZ_mean3'],1)

air['OZ_mean3']
Out[208]:
0      41.0
1      38.5
2      29.7
3      22.0
4      15.0
       ... 
148    17.0
149    22.0
150    22.0
151    16.0
152    17.3
Name: OZ_mean3, Length: 153, dtype: float64

shift()

  • shift() : 데이터를 행/열 방향으로 이동
In [222]:
air['OZ_lag1'] = air['Ozone'].shift(1)    # 1 만큼 밀어냄
air['OZ_lag2'] = air['Ozone'].shift(2)    # 2 만큼 밀어냄

air[['Ozone', 'OZ_lag1', 'OZ_lag2']].head()
Out[222]:
OzoneOZ_lag1OZ_lag201234
41.0 NaN NaN
36.0 41.0 NaN
12.0 36.0 41.0
18.0 12.0 36.0
NaN 18.0 12.0

pivot(), melt

  • pivot() 메서드로 피벗 형태로 변경
  • melt 함수로 언피벗 형태로 되돌림
In [233]:
bike = pd.read_csv('https://bit.ly/BikeFile')
bike['DateTime'] = pd.to_datetime(bike['DateTime'])    # 날짜 형식

bike.head()
Out[233]:
DateTimeTemperatureHumidityWindSpeedSeasonsHolidayCount01234
2017-12-01 00:00:00 -5.2 37 2.2 Winter No Holiday 254
2017-12-01 01:00:00 -5.5 38 0.8 Winter No Holiday 204
2017-12-01 02:00:00 -6.0 39 1.0 Winter No Holiday 173
2017-12-01 03:00:00 -6.2 40 0.9 Winter No Holiday 107
2017-12-01 04:00:00 -6.0 36 2.3 Winter No Holiday 78
In [255]:
bike['Year'] = bike['DateTime'].dt.year
bike['Month'] = bike['DateTime'].dt.month
bike['Day'] = bike['DateTime'].dt.day

bike_info = bike.groupby(['Year','Month', 'Day'], as_index =False)[['Count']].sum()
bike_info.head()
Out[255]:
YearMonthDayCount01234
2017 12 1 9539
2017 12 2 8523
2017 12 3 7222
2017 12 4 8729
2017 12 5 8307

pivot()

  • index : 움직이지 않는 열
  • columns : 새로운 열이 될 현재 열
  • values : 열의 값이 될 데이터
In [276]:
bike_info_w = bike_info.pivot(index = ['Year', 'Month'], columns = 'Day', values = 'Count')
bike_info_w.columns.name = None
bike_info_w.reset_index(drop = False, inplace = True)

bike_info_w.head()
Out[276]:
YearMonth12345678...2223242526272829303101234
2017 12 9539.0 8523.0 7222.0 8729.0 8307.0 6669.0 8549.0 8032.0 ... 7184.0 6624.0 2014.0 3966.0 5605.0 5351.0 6594.0 7663.0 4027.0 3423.0
2018 1 4290.0 6446.0 6512.0 6453.0 6967.0 5180.0 4714.0 5711.0 ... 4503.0 3981.0 3154.0 3113.0 2931.0 2693.0 2828.0 4088.0 3360.0 3830.0
2018 2 5377.0 5954.0 3290.0 2487.0 3932.0 3988.0 4697.0 5613.0 ... 7712.0 5797.0 6356.0 6500.0 9247.0 8862.0 3820.0 NaN NaN NaN
2018 3 5132.0 8433.0 12191.0 4688.0 8597.0 12003.0 11170.0 7711.0 ... 14007.0 14969.0 12243.0 10963.0 13084.0 14710.0 17295.0 17450.0 19301.0 19247.0
2018 4 17388.0 21585.0 21015.0 19995.0 2596.0 11520.0 11159.0 6666.0 ... 6852.0 977.0 20144.0 26285.0 25670.0 25861.0 24940.0 25349.0 25462.0 NaN

5 rows × 33 columns

melt

  • id_vars : 기준 열
  • value_vars: 현재 열 중 내려올 열
  • var_name: value_vars에 지정한 열이 값이 될 때 부여할 열 이름
  • value_name: 새로운 열이 되는 기존 값에 부여할 열 이름
In [288]:
bike_info_n = pd.melt(bike_info_w, 
                      id_vars=['Year', 'Month'],
                      # value_vars=range(1, 32),
                      var_name='Day',
                      value_name='Count')

bike_info_n.head()
Out[288]:
YearMonthDayCount01234
2017 12 1 9539.0
2018 1 1 4290.0
2018 2 1 5377.0
2018 3 1 5132.0
2018 4 1 17388.0
In [ ]:
 

'KT AIVLE > Daily Review' 카테고리의 다른 글

240924  (0) 2024.09.25
240923  (0) 2024.09.24
240919  (0) 2024.09.20
240913  (0) 2024.09.14
240912  (0) 2024.09.13