KT AIVLE/Daily Review

240924

bestone888 2024. 9. 25. 01:19

240924

미니프로젝트1

토익 진단평가 데이터 다듬기

In [122]:

import pandas as pd
import numpy as np

In [124]:

data = pd.read_excel('data04.xlsx')
file = 'data04.xlsx'

In [125]:

data.head()

Out[125]:

IDSeqGenderBirth_YearLC_ScoreRC_ScoreTotal Score학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부토익 모의테스트 횟수Student ID01234

1	1	M	1973	181	173	354	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	6	student1
1	2	M	1973	227	213	440	자기계발	오프라인강의	뉴스/이슈 기반 교재	주1-2회	3.0	알고 있음	5	student1
1	3	M	1973	345	336	681	승진	온라인강의	영상 교재	주5-6회	7.0	알고 있음	10	student1
2	1	F	1982	330	290	620	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	19	student2
2	2	F	1982	354	339	693	승진	온라인강의	영상 교재	주5-6회	2.0	알고 있음	15	student2

데이터 전처리 수행

In [127]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           1500 non-null   int64  
 1   Seq          1500 non-null   int64  
 2   Gender       1500 non-null   object 
 3   Birth_Year   1500 non-null   int64  
 4   LC_Score     1500 non-null   int64  
 5   RC_Score     1500 non-null   int64  
 6   Total Score  1500 non-null   int64  
 7   학습목표         1500 non-null   object 
 8   학습방법         1500 non-null   object 
 9   강의 학습 교재 유형  1500 non-null   object 
 10  학습빈도         1500 non-null   object 
 11  기출문제 공부 횟수   1497 non-null   float64
 12  취약분야 인지 여부   1500 non-null   object 
 13  토익 모의테스트 횟수  1500 non-null   int64  
 14  Student ID   1500 non-null   object 
dtypes: float64(1), int64(7), object(7)
memory usage: 175.9+ KB

In [128]:

# StudentID 삭제
data.drop(columns = ['Student ID'], inplace = True)
data.head()

Out[128]:

IDSeqGenderBirth_YearLC_ScoreRC_ScoreTotal Score학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부토익 모의테스트 횟수01234

1	1	M	1973	181	173	354	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	6
1	2	M	1973	227	213	440	자기계발	오프라인강의	뉴스/이슈 기반 교재	주1-2회	3.0	알고 있음	5
1	3	M	1973	345	336	681	승진	온라인강의	영상 교재	주5-6회	7.0	알고 있음	10
2	1	F	1982	330	290	620	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	19
2	2	F	1982	354	339	693	승진	온라인강의	영상 교재	주5-6회	2.0	알고 있음	15

In [129]:

# 개인정보는 df1, 학습정보 데이터는 df2
# df1 :'ID', 'Gender', 'Birth_Year'
# df2 : 'ID','Seq', 'LC_Score', 'RC_Score', 'Total Score', '학습목표', '학습방법', '강의 학습 교재 유형', '학습빈도', '기출문제 공부 횟수', '취약분야 인지 여부', '토익 모의테스트 횟수'

df1 = data[['ID', 'Gender', 'Birth_Year']].copy()
df2 = data[['ID','Seq', 'LC_Score', 'RC_Score', 'Total Score', '학습목표', '학습방법', '강의 학습 교재 유형', '학습빈도', '기출문제 공부 횟수', '취약분야 인지 여부', '토익 모의테스트 횟수']].copy()

In [130]:

df1

Out[130]:

IDGenderBirth_Year01234...14951496149714981499

1	M	1973
1	M	1973
1	M	1973
2	F	1982
2	F	1982
...	...	...
499	F	1990
499	F	1990
500	M	1984
500	M	1984
500	M	1984

1500 rows × 3 columns

In [131]:

# df1에서 중복된 열 제거
df1.drop_duplicates(subset = 'ID', keep = 'first', inplace = True)
df1.reset_index(drop = True, inplace = True)
df1

Out[131]:

IDGenderBirth_Year01234...495496497498499

1	M	1973
2	F	1982
3	F	1995
4	M	1987
5	M	1994
...	...	...
496	M	2006
497	F	1988
498	M	2006
499	F	1990
500	M	1984

500 rows × 3 columns

In [132]:

df2

Out[132]:

IDSeqLC_ScoreRC_ScoreTotal Score학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부토익 모의테스트 횟수01234...14951496149714981499

1	1	181	173	354	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	6
1	2	227	213	440	자기계발	오프라인강의	뉴스/이슈 기반 교재	주1-2회	3.0	알고 있음	5
1	3	345	336	681	승진	온라인강의	영상 교재	주5-6회	7.0	알고 있음	10
2	1	330	290	620	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	19
2	2	354	339	693	승진	온라인강의	영상 교재	주5-6회	2.0	알고 있음	15
...	...	...	...	...	...	...	...	...	...	...	...
499	2	378	326	704	승진	온라인강의	뉴스/이슈 기반 교재	주5-6회	6.0	알고 있지 않음	12
499	3	422	370	792	자기계발	오프라인강의	비즈니스 시뮬레이션(Role Play)	주3-4회	4.0	알고 있음	7
500	1	169	188	357	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	8.0	알고 있지 않음	2
500	2	172	190	362	자기계발	참고서	뉴스/이슈 기반 교재	매일(주 7회)	10.0	알고 있음	16
500	3	235	226	461	승진	오프라인강의	비즈니스 시뮬레이션(Role Play)	주5-6회	7.0	알고 있음	15

1500 rows × 12 columns

In [149]:

# 회차 별 정보 한 행으로 만들기

# 1회차 토익시험
temp1 = df2.loc[df2['Seq'] == 1]
temp1.rename(columns =  {'LC_Score':'1st_LC_Score', 'RC_Score':'1st_RC_Score', 'Total Score':'1st_Total_Score'}, inplace = True)

# 2회차 토익시험
temp2 = df2.loc[df2['Seq'] == 2]
temp2 = temp2[['ID','LC_Score','RC_Score','Total Score']]
temp2.rename(columns =  {'LC_Score':'2nd_LC_Score', 'RC_Score':'2nd_RC_Score', 'Total Score':'2nd_Total_Score'}, inplace = True)

# 3회차 토익시험
temp3 = df2.loc[df2['Seq'] == 3]
temp3 = temp3[['ID','LC_Score','RC_Score','Total Score']]
temp3.rename(columns =  {'LC_Score':'3rd_LC_Score', 'RC_Score':'3rd_RC_Score', 'Total Score':'3rd_Total_Score'}, inplace = True)

C:\Users\User\AppData\Local\Temp\ipykernel_4360\3689510066.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp1.rename(columns =  {'LC_Score':'1st_LC_Score', 'RC_Score':'1st_RC_Score', 'Total Score':'1st_Total_Score'}, inplace = True)

In [151]:

# temp1, temp2, temp3 합차기
score_merged_data1 = pd.merge(temp1, temp2, how = 'outer', on = 'ID')
score_merged_data2 = pd.merge(score_merged_data1, temp3, how = 'outer', on = 'ID')
score_merged_data2

Out[151]:

IDSeq1st_LC_Score1st_RC_Score1st_Total_Score학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부토익 모의테스트 횟수2nd_LC_Score2nd_RC_Score2nd_Total_Score3rd_LC_Score3rd_RC_Score3rd_Total_Score01234...495496497498499

1	1	181	173	354	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	6	227	213	440	345	336	681
2	1	330	290	620	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	19	354	339	693	380	368	748
3	1	367	309	676	취업	온라인강의	영상 교재	매일(주 7회)	9.0	알고 있지 않음	7	396	365	761	416	382	798
4	1	470	285	755	자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	7.0	알고 있지 않음	4	495	341	836	495	397	892
5	1	273	372	645	승진	오프라인강의	비즈니스 시뮬레이션(Role Play)	주5-6회	3.0	알고 있지 않음	13	314	426	740	398	437	835
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
496	1	347	315	662	자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	7.0	알고 있지 않음	1	349	321	670	364	336	700
497	1	112	250	362	자기계발	온라인강의	영상 교재	주5-6회	4.0	알고 있지 않음	10	120	251	371	187	252	439
498	1	252	150	402	자기계발	온라인강의	영상 교재	주3-4회	6.0	알고 있지 않음	15	254	158	412	255	167	422
499	1	371	324	695	자기계발	오프라인강의	비즈니스 시뮬레이션(Role Play)	주1-2회	NaN	알고 있지 않음	5	378	326	704	422	370	792
500	1	169	188	357	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	8.0	알고 있지 않음	2	172	190	362	235	226	461

500 rows × 18 columns

In [153]:

# 개인정보 데이터 df1과 토익시험 학습정보 score_merged_data2 합치기
baseline_data = pd.merge(df1, score_merged_data2, how = 'outer', on = 'ID')
baseline_data.head()

Out[153]:

IDGenderBirth_YearSeq1st_LC_Score1st_RC_Score1st_Total_Score학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부토익 모의테스트 횟수2nd_LC_Score2nd_RC_Score2nd_Total_Score3rd_LC_Score3rd_RC_Score3rd_Total_Score01234

1	M	1973	1	181	173	354	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	6	227	213	440	345	336	681
2	F	1982	1	330	290	620	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	19	354	339	693	380	368	748
3	F	1995	1	367	309	676	취업	온라인강의	영상 교재	매일(주 7회)	9.0	알고 있지 않음	7	396	365	761	416	382	798
4	M	1987	1	470	285	755	자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	7.0	알고 있지 않음	4	495	341	836	495	397	892
5	M	1994	1	273	372	645	승진	오프라인강의	비즈니스 시뮬레이션(Role Play)	주5-6회	3.0	알고 있지 않음	13	314	426	740	398	437	835

In [158]:

# 2차시와 3차시 시험점수의 차이 'Score_diff_total'
baseline_data['Score_diff_total'] = baseline_data['1st_Total_Score']-baseline_data['2nd_Total_Score']

In [163]:

# csv 파일로 저장, 파일명 baseline_data

baseline_data.to_csv('data04_baseline.csv', index = False)

In [ ]:

토익진단평가 데이터 다듬기2

In [177]:

!pip install matplotlib
!pip install --upgrade matplotlib

import matplotlib.pyplot as plt

Requirement already satisfied: matplotlib in c:\users\user\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.23 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: matplotlib in c:\users\user\anaconda3\lib\site-packages (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.23 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\user\anaconda3\lib\site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\user\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)

In [178]:

plt.rc('font', family='Malgun Gothic')

In [206]:

data = pd.read_csv('data04_baseline.csv')

In [208]:

data.columns

Out[208]:

Index(['ID', 'Gender', 'Birth_Year', 'Seq', '1st_LC_Score', '1st_RC_Score',
       '1st_Total_Score', '학습목표', '학습방법', '강의 학습 교재 유형', '학습빈도', '기출문제 공부 횟수',
       '취약분야 인지 여부', '토익 모의테스트 횟수', '2nd_LC_Score', '2nd_RC_Score',
       '2nd_Total_Score', '3rd_LC_Score', '3rd_RC_Score', '3rd_Total_Score',
       'Score_diff_total'],
      dtype='object')

In [210]:

# column의 순서 변경
data = data[['ID', 'Gender', 'Birth_Year', 'Seq', 
       '학습목표', '학습방법', '강의 학습 교재 유형', '학습빈도', '기출문제 공부 횟수',
       '취약분야 인지 여부', '토익 모의테스트 횟수','1st_LC_Score', '1st_RC_Score', '1st_Total_Score', '2nd_LC_Score', '2nd_RC_Score',
       '2nd_Total_Score', '3rd_LC_Score', '3rd_RC_Score', '3rd_Total_Score',
       'Score_diff_total']]

In [212]:

data

Out[212]:

IDGenderBirth_YearSeq학습목표학습방법강의 학습 교재 유형학습빈도기출문제 공부 횟수취약분야 인지 여부...1st_LC_Score1st_RC_Score1st_Total_Score2nd_LC_Score2nd_RC_Score2nd_Total_Score3rd_LC_Score3rd_RC_Score3rd_Total_ScoreScore_diff_total01234...495496497498499

1	M	1973	1	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	6.0	알고 있지 않음	...	181	173	354	227	213	440	345	336	681	-86
2	F	1982	1	자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	8.0	알고 있지 않음	...	330	290	620	354	339	693	380	368	748	-73
3	F	1995	1	취업	온라인강의	영상 교재	매일(주 7회)	9.0	알고 있지 않음	...	367	309	676	396	365	761	416	382	798	-85
4	M	1987	1	자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	7.0	알고 있지 않음	...	470	285	755	495	341	836	495	397	892	-81
5	M	1994	1	승진	오프라인강의	비즈니스 시뮬레이션(Role Play)	주5-6회	3.0	알고 있지 않음	...	273	372	645	314	426	740	398	437	835	-95
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
496	M	2006	1	자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	7.0	알고 있지 않음	...	347	315	662	349	321	670	364	336	700	-8
497	F	1988	1	자기계발	온라인강의	영상 교재	주5-6회	4.0	알고 있지 않음	...	112	250	362	120	251	371	187	252	439	-9
498	M	2006	1	자기계발	온라인강의	영상 교재	주3-4회	6.0	알고 있지 않음	...	252	150	402	254	158	412	255	167	422	-10
499	F	1990	1	자기계발	오프라인강의	비즈니스 시뮬레이션(Role Play)	주1-2회	NaN	알고 있지 않음	...	371	324	695	378	326	704	422	370	792	-9
500	M	1984	1	자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	8.0	알고 있지 않음	...	169	188	357	172	190	362	235	226	461	-5

500 rows × 21 columns

In [220]:

# Gender 값을 ['M', 'F'] --> [1,2]로 변경
data['Gender'] = data['Gender'].replace({'M':1, 'F':2})

# map() 방법 이용 : data['Gender'].map({'M':1, 'F':2})
data['Gender'].astype(int)

Out[220]:

0      1
1      2
2      2
3      1
4      1
      ..
495    1
496    2
497    1
498    2
499    1
Name: Gender, Length: 500, dtype: int32

In [222]:

data['Birth_Year']

Out[222]:

0      1973
1      1982
2      1995
3      1987
4      1994
       ... 
495    2006
496    1988
497    2006
498    1990
499    1984
Name: Birth_Year, Length: 500, dtype: int64

In [232]:

data['Birth_Year'].describe()

Out[232]:

count     500.000000
mean     1992.906000
std         8.224381
min      1973.000000
25%      1986.750000
50%      1992.500000
75%      2000.000000
max      2007.000000
Name: Birth_Year, dtype: float64

In [252]:

# Birth_Year 별 인자 수
data['Birth_Year'].value_counts()

Out[252]:

Birth_Year
1992    25
1990    24
1988    22
1995    21
1986    21
1989    19
1994    19
2006    19
1991    19
1993    19
2003    17
1985    17
1997    17
1999    16
2005    16
2001    16
1998    16
2007    16
1984    16
1983    16
1987    16
2002    15
2000    15
1982    14
2004    14
1996    14
1981    13
1980     8
1979     6
1978     3
1974     3
1973     3
1975     2
1977     2
1976     1
Name: count, dtype: int64

In [258]:

# 시각화 방법 너무 많아...
plt.bar(data['Birth_Year'].value_counts().index, data['Birth_Year'].value_counts())
plt.plot()

Out[258]:

[]

In [269]:

# select_dtypes() 함수
# object 컬럼만 추출
data.select_dtypes(include = 'object')

Out[269]:

학습목표학습방법강의 학습 교재 유형학습빈도취약분야 인지 여부01234...495496497498499

자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	알고 있지 않음
자기계발	오프라인강의	뉴스/이슈 기반 교재	매일(주 7회)	알고 있지 않음
취업	온라인강의	영상 교재	매일(주 7회)	알고 있지 않음
자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	알고 있지 않음
승진	오프라인강의	비즈니스 시뮬레이션(Role Play)	주5-6회	알고 있지 않음
...	...	...	...	...
자기계발	온라인강의	뉴스/이슈 기반 교재	주1-2회	알고 있지 않음
자기계발	온라인강의	영상 교재	주5-6회	알고 있지 않음
자기계발	온라인강의	영상 교재	주3-4회	알고 있지 않음
자기계발	오프라인강의	비즈니스 시뮬레이션(Role Play)	주1-2회	알고 있지 않음
자기계발	참고서	일반적인 영어 텍스트 기반 교재	주3-4회	알고 있지 않음

500 rows × 5 columns

In [271]:

# 그 중 열 이름만 추출
data.select_dtypes(include = 'object').columns.values

Out[271]:

array(['학습목표', '학습방법', '강의 학습 교재 유형', '학습빈도', '취약분야 인지 여부'], dtype=object)

In [274]:

# '학습목표' 값들의 빈도수 계산
data['학습목표'].value_counts()

Out[274]:

학습목표
자기계발    354
취업       81
승진       65
Name: count, dtype: int64

In [280]:

# 시각화
plt.figure(figsize = (5,3))
target = data['학습목표'].value_counts().sort_index()
target.plot(kind= 'bar')
plt.plot()

Out[280]:

[]

In [300]:

# '강의 학습 교재 유형'의 값들의 빈도수, 비율 계산 출력

# 방법1
print(data['강의 학습 교재 유형'].value_counts())
print()
print(data['강의 학습 교재 유형'].value_counts(normalize  = True))
print()

# 방법2
s = data['강의 학습 교재 유형'].value_counts().sum()
print(data['강의 학습 교재 유형'].value_counts()/s)

강의 학습 교재 유형
영상 교재                    138
일반적인 영어 텍스트 기반 교재        134
뉴스/이슈 기반 교재              134
비즈니스 시뮬레이션(Role Play)     94
Name: count, dtype: int64

강의 학습 교재 유형
영상 교재                    0.276
일반적인 영어 텍스트 기반 교재        0.268
뉴스/이슈 기반 교재              0.268
비즈니스 시뮬레이션(Role Play)    0.188
Name: proportion, dtype: float64

강의 학습 교재 유형
영상 교재                    0.276
일반적인 영어 텍스트 기반 교재        0.268
뉴스/이슈 기반 교재              0.268
비즈니스 시뮬레이션(Role Play)    0.188
Name: count, dtype: float64