변수 검토
먼저 이후의 작업상의 편의를 위해 디렉토리를 지정, 패키지를 불러온다. names()로 변수명을 확인하고 추가적으로 str()나 summary()등을 이용해 전체 데이터에 대해 파악한다.
library('readxl')
## Warning: 패키지 'readxl'는 R 버전 3.6.3에서 작성되었습니다
airbnb<-read_excel('data/airbnb.xlsx')
air<-airbnb
names(air)
## [1] "id" "log_price" "property_type"
## [4] "room_type" "amenities" "accommodates"
## [7] "bathrooms" "bed_type" "cancellation_policy"
## [10] "cleaning_fee" "city" "description"
## [13] "first_review" "host_has_profile_pic" "host_identity_verified"
## [16] "host_response_rate" "host_since" "instant_bookable"
## [19] "last_review" "latitude" "longitude"
## [22] "name" "neighbourhood" "number_of_reviews"
## [25] "review_scores_rating" "thumbnail_url" "zipcode"
## [28] "bedrooms" "beds"
“property_type”은 ‘House’, ‘Aprtment’, ’Other’ 등의 3범주로 변환하시오.
air$property_type<-ifelse(air$property_type=='House','House',ifelse(air$property_type=='Apartment','Apartment','Other'))
“bed_type”은 ‘Bed’, ‘Other’ 등의 2범주로 변환하시오.
air$bed_type<-ifelse(air$bed_type=='Airbed'|air$bed_type=='Real Bed','Bed','Other')
“number_of_reviews”가 11개 이상인 데이터만 추출해서 분석에 사용하시오.
air<-air[!(air$number_of_reviews<11),]
‘가격비(price_ratio)’ 변수를 생성하시오.
library('dplyr')
## Warning: 패키지 'dplyr'는 R 버전 3.6.3에서 작성되었습니다
##
## 다음의 패키지를 부착합니다: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
air$original_price<-exp(air$log_price)
city_price<-aggregate(original_price~city,air,mean)
names(city_price)[2]<-c('city_price')
air<-left_join(air,city_price,by='city')
air$price_ratio<-air$original_price/air$city_price*100
1. “가격비(price_ratio)” 변수의 평균과 표준편차를 답하시오.
mean(air$price_ratio)
## [1] 100
sqrt(var(air$price_ratio))
## [1] 83.72338
2. “가격비(price_ratio)”를 종속변수로 하여 선형회귀분석을 수행하시오.
데이터 전처리
t/f로 출력되는 변수는 TRUE/FALSE로 변환한다. 문자로 이루어진 amenities 항목은 ,의 개수에 1을 더하여 해당 숙소가 갖고 있는 편의시설 개수로 데이터를 가공한다. 추가적으로 first review, host since, last review 항목에는 미래의 날짜가 입력되어있는 경우가 많아 이상치로 판단하여 회귀모델 변수로 사용하지 않는다.
air$host_has_profile_pic<-ifelse(air$host_has_profile_pic=='t',TRUE,FALSE)
air$host_identity_verified<-ifelse(air$host_identity_verified=='t',TRUE,FALSE)
air$instant_bookable<-ifelse(air$instant_bookable=='t',TRUE,FALSE)
library('stringr')
## Warning: 패키지 'stringr'는 R 버전 3.6.3에서 작성되었습니다
air$amenity_num<-str_count(air$amenities,',')+1
불필요한 변수 제거
id, 썸네일 url등 숙소 정보와 관련이 없는 변수를 제외한다. 또한 amenities, original_price 등 가공한 데이터의 원본 데이터도 제외한다. zipcode는 미국의 지역번호로써 유의미한 변수가 될 수 있어보이나, 해당 변수를 추가하여 이하 작업을 수행한 결과, 컴퓨터에 과한 부하가 걸려 계산이 불가능하므로 제외하도록 한다. 결측치는 비중이 크지않은 것으로 판단되어, 제외하도록 한다. 변수들은 서로 독립이라는 가정 하에 아래 작업을 수행하도록 한다.
air2<-subset(air,select=-c(id,log_price,amenities,description,first_review,host_since,last_review,name,neighbourhood,thumbnail_url,zipcode,original_price,city_price))
table(is.na(air2))
##
## FALSE TRUE
## 578412 1908
air2<-na.omit(air2)
회귀식 생성
lm<-lm(price_ratio~property_type+room_type+accommodates+bathrooms+bed_type+cancellation_policy+cleaning_fee+city+host_has_profile_pic+host_identity_verified+host_response_rate+instant_bookable+latitude+longitude+number_of_reviews+review_scores_rating+bedrooms+beds+amenity_num,data=air2)
steplm<-step(lm, direction='both')
## Start: AIC=220683.8
## price_ratio ~ property_type + room_type + accommodates + bathrooms +
## bed_type + cancellation_policy + cleaning_fee + city + host_has_profile_pic +
## host_identity_verified + host_response_rate + instant_bookable +
## latitude + longitude + number_of_reviews + review_scores_rating +
## bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - host_has_profile_pic 1 308 89030290 220682
## - bed_type 1 3035 89033017 220683
## <none> 89029982 220684
## - host_identity_verified 1 8336 89038318 220684
## - latitude 1 11553 89041535 220685
## - property_type 2 44049 89074031 220693
## - cleaning_fee 1 59222 89089204 220700
## - amenity_num 1 75567 89105549 220705
## - instant_bookable 1 77971 89107953 220706
## - host_response_rate 1 110888 89140870 220716
## - number_of_reviews 1 121535 89151517 220719
## - beds 1 169479 89199461 220734
## - cancellation_policy 4 751814 89781796 220905
## - review_scores_rating 1 1568201 90598183 221158
## - accommodates 1 2046720 91076702 221302
## - longitude 1 3156825 92186807 221632
## - city 5 4175441 93205423 221924
## - bedrooms 1 4679603 93709585 222079
## - bathrooms 1 6682660 95712642 222655
## - room_type 2 10202136 99232118 223638
##
## Step: AIC=220681.9
## price_ratio ~ property_type + room_type + accommodates + bathrooms +
## bed_type + cancellation_policy + cleaning_fee + city + host_identity_verified +
## host_response_rate + instant_bookable + latitude + longitude +
## number_of_reviews + review_scores_rating + bedrooms + beds +
## amenity_num
##
## Df Sum of Sq RSS AIC
## - bed_type 1 3024 89033314 220681
## <none> 89030290 220682
## - host_identity_verified 1 8218 89038508 220682
## - latitude 1 11581 89041871 220683
## + host_has_profile_pic 1 308 89029982 220684
## - property_type 2 44009 89074299 220691
## - cleaning_fee 1 59203 89089493 220698
## - amenity_num 1 75496 89105786 220703
## - instant_bookable 1 77933 89108223 220704
## - host_response_rate 1 110908 89141198 220714
## - number_of_reviews 1 121487 89151777 220717
## - beds 1 169245 89199535 220732
## - cancellation_policy 4 751671 89781961 220903
## - review_scores_rating 1 1568185 90598474 221156
## - accommodates 1 2046585 91076875 221300
## - longitude 1 3156518 92186808 221630
## - city 5 4175208 93205497 221922
## - bedrooms 1 4679824 93710114 222077
## - bathrooms 1 6682550 95712840 222653
## - room_type 2 10202679 99232969 223636
##
## Step: AIC=220680.8
## price_ratio ~ property_type + room_type + accommodates + bathrooms +
## cancellation_policy + cleaning_fee + city + host_identity_verified +
## host_response_rate + instant_bookable + latitude + longitude +
## number_of_reviews + review_scores_rating + bedrooms + beds +
## amenity_num
##
## Df Sum of Sq RSS AIC
## <none> 89033314 220681
## - host_identity_verified 1 8326 89041640 220681
## + bed_type 1 3024 89030290 220682
## - latitude 1 11534 89044849 220682
## + host_has_profile_pic 1 297 89033017 220683
## - property_type 2 44020 89077335 220690
## - cleaning_fee 1 59151 89092465 220697
## - amenity_num 1 75070 89108385 220702
## - instant_bookable 1 78672 89111986 220703
## - host_response_rate 1 110801 89144116 220713
## - number_of_reviews 1 121141 89154456 220716
## - beds 1 170681 89203995 220731
## - cancellation_policy 4 750121 89783435 220902
## - review_scores_rating 1 1569488 90602802 221155
## - accommodates 1 2047791 91081106 221299
## - longitude 1 3155985 92189299 221629
## - city 5 4174961 93208275 221920
## - bedrooms 1 4681040 93714354 222076
## - bathrooms 1 6679581 95712895 222651
## - room_type 2 10269581 99302895 223653
summary(steplm)
##
## Call:
## lm(formula = price_ratio ~ property_type + room_type + accommodates +
## bathrooms + cancellation_policy + cleaning_fee + city + host_identity_verified +
## host_response_rate + instant_bookable + latitude + longitude +
## number_of_reviews + review_scores_rating + bedrooms + beds +
## amenity_num, data = air2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -412.68 -26.01 -3.81 18.17 992.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.406e+03 3.515e+02 -26.760 < 2e-16 ***
## property_typeHouse 1.025e+00 9.175e-01 1.117 0.264107
## property_typeOther 4.119e+00 1.122e+00 3.669 0.000244 ***
## room_typePrivate room -4.625e+01 8.756e-01 -52.819 < 2e-16 ***
## room_typeShared room -7.431e+01 2.529e+00 -29.383 < 2e-16 ***
## accommodates 8.371e+00 3.344e-01 25.031 < 2e-16 ***
## bathrooms 3.492e+01 7.724e-01 45.207 < 2e-16 ***
## cancellation_policymoderate 1.955e+00 1.164e+00 1.679 0.093071 .
## cancellation_policystrict 9.082e+00 1.117e+00 8.131 4.44e-16 ***
## cancellation_policysuper_strict_30 6.185e+01 1.090e+01 5.676 1.40e-08 ***
## cancellation_policysuper_strict_60 3.920e+02 4.051e+01 9.675 < 2e-16 ***
## cleaning_feeTRUE -4.334e+00 1.019e+00 -4.254 2.11e-05 ***
## cityChicago -2.073e+03 6.673e+01 -31.059 < 2e-16 ***
## cityDC -7.141e+02 3.036e+01 -23.522 < 2e-16 ***
## cityLA -5.817e+03 1.963e+02 -29.638 < 2e-16 ***
## cityNYC -3.306e+02 1.459e+01 -22.659 < 2e-16 ***
## citySF -6.358e+03 2.085e+02 -30.497 < 2e-16 ***
## host_identity_verifiedTRUE 1.379e+00 8.638e-01 1.596 0.110482
## host_response_rate -2.068e+01 3.551e+00 -5.822 5.86e-09 ***
## instant_bookableTRUE -3.813e+00 7.772e-01 -4.906 9.34e-07 ***
## latitude 9.694e+00 5.160e+00 1.879 0.060310 .
## longitude -1.247e+02 4.013e+00 -31.074 < 2e-16 ***
## number_of_reviews -4.446e-02 7.303e-03 -6.088 1.16e-09 ***
## review_scores_rating 1.759e+00 8.028e-02 21.914 < 2e-16 ***
## bedrooms 2.514e+01 6.644e-01 37.845 < 2e-16 ***
## beds -3.505e+00 4.850e-01 -7.226 5.09e-13 ***
## amenity_num 2.608e-01 5.441e-02 4.793 1.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.17 on 27241 degrees of freedom
## Multiple R-squared: 0.5354, Adjusted R-squared: 0.5349
## F-statistic: 1207 on 26 and 27241 DF, p-value: < 2.2e-16
p-value는 유의미한 것으로 판단되나, R^2가 다소 낮은 것으로 확인된다.
정규성 및 등분산성 검정
잔차 분석 결과 qqplot의 경우 직선의 모양을 띄고있지 않으며, residual plot은 점점 확산되는 양상을 보인다. 이를 교정하기 위해 Y변수를 로그 변환하도록 한다.
로그변환 후 회귀식 생성
air2$log_price_ratio<-log(air2$price_ratio)
lm2<-lm(log_price_ratio~property_type+room_type+accommodates+bathrooms+bed_type+cancellation_policy+cleaning_fee+city+host_has_profile_pic+host_identity_verified+host_response_rate+instant_bookable+latitude+longitude+number_of_reviews+review_scores_rating+bedrooms+beds+amenity_num,data=air2)
steplm2<-step(lm2, direction='both')
## Start: AIC=-53791.79
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_has_profile_pic + host_identity_verified + host_response_rate +
## instant_bookable + latitude + longitude + number_of_reviews +
## review_scores_rating + bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - host_has_profile_pic 1 0.05 3784.4 -53793
## - number_of_reviews 1 0.22 3784.6 -53792
## <none> 3784.4 -53792
## - host_identity_verified 1 0.32 3784.7 -53791
## - cleaning_fee 1 0.38 3784.7 -53791
## - latitude 1 0.51 3784.9 -53790
## - bed_type 1 1.32 3785.7 -53784
## - instant_bookable 1 5.99 3790.4 -53751
## - host_response_rate 1 6.23 3790.6 -53749
## - property_type 2 10.95 3795.3 -53717
## - amenity_num 1 13.23 3797.6 -53699
## - beds 1 22.06 3806.4 -53635
## - cancellation_policy 4 24.08 3808.4 -53627
## - bathrooms 1 75.10 3859.5 -53258
## - accommodates 1 130.37 3914.7 -52870
## - review_scores_rating 1 131.41 3915.8 -52863
## - bedrooms 1 200.86 3985.2 -52384
## - longitude 1 244.84 4029.2 -52084
## - city 5 292.63 4077.0 -51771
## - room_type 2 1748.75 5533.1 -43437
##
## Step: AIC=-53793.45
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## latitude + longitude + number_of_reviews + review_scores_rating +
## bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - number_of_reviews 1 0.22 3784.6 -53794
## <none> 3784.4 -53793
## - host_identity_verified 1 0.31 3784.7 -53793
## - cleaning_fee 1 0.38 3784.8 -53793
## + host_has_profile_pic 1 0.05 3784.4 -53792
## - latitude 1 0.51 3784.9 -53792
## - bed_type 1 1.32 3785.7 -53786
## - instant_bookable 1 5.98 3790.4 -53752
## - host_response_rate 1 6.23 3790.6 -53751
## - property_type 2 10.93 3795.4 -53719
## - amenity_num 1 13.22 3797.6 -53700
## - beds 1 22.03 3806.4 -53637
## - cancellation_policy 4 24.06 3808.5 -53629
## - bathrooms 1 75.09 3859.5 -53260
## - accommodates 1 130.36 3914.8 -52872
## - review_scores_rating 1 131.41 3915.8 -52865
## - bedrooms 1 200.81 3985.2 -52386
## - longitude 1 244.80 4029.2 -52086
## - city 5 292.60 4077.0 -51773
## - room_type 2 1748.81 5533.2 -43439
##
## Step: AIC=-53793.83
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## latitude + longitude + review_scores_rating + bedrooms +
## beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - host_identity_verified 1 0.26 3784.9 -53794
## <none> 3784.6 -53794
## + number_of_reviews 1 0.22 3784.4 -53793
## - cleaning_fee 1 0.34 3785.0 -53793
## + host_has_profile_pic 1 0.05 3784.6 -53792
## - latitude 1 0.53 3785.2 -53792
## - bed_type 1 1.33 3786.0 -53786
## - instant_bookable 1 6.14 3790.8 -53752
## - host_response_rate 1 6.43 3791.1 -53750
## - property_type 2 11.08 3795.7 -53718
## - amenity_num 1 13.13 3797.8 -53701
## - beds 1 22.08 3806.7 -53637
## - cancellation_policy 4 23.91 3808.6 -53630
## - bathrooms 1 75.48 3860.1 -53257
## - accommodates 1 130.17 3914.8 -52874
## - review_scores_rating 1 131.85 3916.5 -52862
## - bedrooms 1 202.70 3987.3 -52373
## - longitude 1 244.66 4029.3 -52088
## - city 5 292.53 4077.2 -51774
## - room_type 2 1749.47 5534.1 -43437
##
## Step: AIC=-53793.94
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_response_rate + instant_bookable + latitude +
## longitude + review_scores_rating + bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## <none> 3784.9 -53794
## + host_identity_verified 1 0.26 3784.6 -53794
## - cleaning_fee 1 0.30 3785.2 -53794
## + number_of_reviews 1 0.18 3784.7 -53793
## + host_has_profile_pic 1 0.04 3784.9 -53792
## - latitude 1 0.52 3785.4 -53792
## - bed_type 1 1.32 3786.2 -53786
## - instant_bookable 1 6.39 3791.3 -53750
## - host_response_rate 1 6.42 3791.3 -53750
## - property_type 2 11.06 3796.0 -53718
## - amenity_num 1 13.55 3798.5 -53698
## - beds 1 22.07 3807.0 -53637
## - cancellation_policy 4 24.17 3809.1 -53628
## - bathrooms 1 75.44 3860.3 -53258
## - accommodates 1 130.16 3915.1 -52874
## - review_scores_rating 1 132.33 3917.2 -52859
## - bedrooms 1 202.68 3987.6 -52374
## - longitude 1 244.72 4029.6 -52088
## - city 5 292.28 4077.2 -51776
## - room_type 2 1749.32 5534.2 -43438
summary(steplm2)
##
## Call:
## lm(formula = log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_response_rate + instant_bookable + latitude +
## longitude + review_scores_rating + bedrooms + beds + amenity_num,
## data = air2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5592 -0.2379 -0.0065 0.2303 2.5579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.805e+01 2.291e+00 -34.064 < 2e-16 ***
## property_typeHouse -4.528e-02 5.979e-03 -7.573 3.75e-14 ***
## property_typeOther 1.722e-02 7.317e-03 2.353 0.01861 *
## room_typePrivate room -5.861e-01 5.711e-03 -102.628 < 2e-16 ***
## room_typeShared room -1.110e+00 1.674e-02 -66.303 < 2e-16 ***
## accommodates 6.672e-02 2.180e-03 30.608 < 2e-16 ***
## bathrooms 1.173e-01 5.033e-03 23.302 < 2e-16 ***
## bed_typeOther -4.623e-02 1.499e-02 -3.083 0.00205 **
## cancellation_policymoderate 3.010e-02 7.566e-03 3.979 6.93e-05 ***
## cancellation_policystrict 7.560e-02 7.254e-03 10.421 < 2e-16 ***
## cancellation_policysuper_strict_30 4.080e-01 7.103e-02 5.744 9.37e-09 ***
## cancellation_policysuper_strict_60 6.677e-01 2.641e-01 2.528 0.01148 *
## cleaning_feeTRUE -9.787e-03 6.613e-03 -1.480 0.13891
## cityChicago -1.826e+01 4.345e-01 -42.012 < 2e-16 ***
## cityDC -6.339e+00 1.976e-01 -32.075 < 2e-16 ***
## cityLA -5.137e+01 1.278e+00 -40.197 < 2e-16 ***
## cityNYC -2.992e+00 9.495e-02 -31.508 < 2e-16 ***
## citySF -5.602e+01 1.358e+00 -41.263 < 2e-16 ***
## host_response_rate -1.570e-01 2.309e-02 -6.797 1.09e-11 ***
## instant_bookableTRUE -3.420e-02 5.043e-03 -6.782 1.21e-11 ***
## latitude 6.535e-02 3.363e-02 1.943 0.05201 .
## longitude -1.097e+00 2.614e-02 -41.968 < 2e-16 ***
## review_scores_rating 1.614e-02 5.230e-04 30.862 < 2e-16 ***
## bedrooms 1.651e-01 4.322e-03 38.194 < 2e-16 ***
## beds -3.987e-02 3.164e-03 -12.604 < 2e-16 ***
## amenity_num 3.489e-03 3.533e-04 9.876 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3727 on 27242 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6464
## F-statistic: 1995 on 25 and 27242 DF, p-value: < 2.2e-16
p-value, R^2 모두 유의하다고 볼 수 있다
정규성 및 등분산성 검정
qqplot은 양 끝이 살짝 곡선 형태를 띄나, 그 정도가 심하지 않아 정규성을 만족하는 것으로 볼 수 있다. residual plot은 이전 그래프와 달리, random한 양상을 띄며 등분산성을 만족하는 것으로 해석할 수 있다.
예측력 검정
다음으로 원 데이터를 train, test 데이터로 분할하여 위 과정과 같이 회귀식을 구하고 예측력을 확인해보도록 한다. 이 과정은 회귀 모델의 overfitting을 막아 유의미한 예측을 가능토록 한다.
training, test data로 분할
train과 test 데이터의 비율은 7:3으로 한다.
nobs=nrow(air2)
set.seed(999)
i = sample(1:nobs, round(nobs*0.7))
train = air2[i,]
test = air2[-i,]
nrow(train);nrow(test)
## [1] 19088
## [1] 8180
회귀식 생성
lm3<-lm(log_price_ratio~property_type+room_type+accommodates+bathrooms+bed_type+cancellation_policy+cleaning_fee+city+host_has_profile_pic+host_identity_verified+host_response_rate+instant_bookable+latitude+longitude+number_of_reviews+review_scores_rating+bedrooms+beds+amenity_num,data=train)
steplm3<-step(lm3, direction='both')
## Start: AIC=-37807.25
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_has_profile_pic + host_identity_verified + host_response_rate +
## instant_bookable + latitude + longitude + number_of_reviews +
## review_scores_rating + bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - host_has_profile_pic 1 0.00 2625.7 -37809
## - number_of_reviews 1 0.13 2625.8 -37808
## - latitude 1 0.14 2625.8 -37808
## <none> 2625.7 -37807
## - cleaning_fee 1 0.30 2626.0 -37807
## - host_identity_verified 1 0.37 2626.0 -37807
## - bed_type 1 1.27 2626.9 -37800
## - instant_bookable 1 4.71 2630.4 -37775
## - host_response_rate 1 5.29 2631.0 -37771
## - property_type 2 6.50 2632.2 -37764
## - amenity_num 1 12.09 2637.8 -37722
## - beds 1 15.43 2641.1 -37697
## - cancellation_policy 4 17.34 2643.0 -37690
## - bathrooms 1 52.60 2678.3 -37431
## - review_scores_rating 1 83.57 2709.3 -37211
## - accommodates 1 90.21 2715.9 -37164
## - bedrooms 1 132.90 2758.6 -36867
## - longitude 1 166.97 2792.6 -36632
## - city 5 201.62 2827.3 -36405
## - room_type 2 1232.58 3858.3 -30465
##
## Step: AIC=-37809.22
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## latitude + longitude + number_of_reviews + review_scores_rating +
## bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - number_of_reviews 1 0.13 2625.8 -37810
## - latitude 1 0.14 2625.8 -37810
## <none> 2625.7 -37809
## - cleaning_fee 1 0.30 2626.0 -37809
## - host_identity_verified 1 0.36 2626.0 -37809
## + host_has_profile_pic 1 0.00 2625.7 -37807
## - bed_type 1 1.27 2627.0 -37802
## - instant_bookable 1 4.71 2630.4 -37777
## - host_response_rate 1 5.29 2631.0 -37773
## - property_type 2 6.49 2632.2 -37766
## - amenity_num 1 12.09 2637.8 -37724
## - beds 1 15.43 2641.1 -37699
## - cancellation_policy 4 17.34 2643.0 -37692
## - bathrooms 1 52.60 2678.3 -37433
## - review_scores_rating 1 83.57 2709.3 -37213
## - accommodates 1 90.21 2715.9 -37166
## - bedrooms 1 132.96 2758.6 -36868
## - longitude 1 166.98 2792.7 -36634
## - city 5 201.63 2827.3 -36407
## - room_type 2 1232.60 3858.3 -30467
##
## Step: AIC=-37810.3
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## latitude + longitude + review_scores_rating + bedrooms +
## beds + amenity_num
##
## Df Sum of Sq RSS AIC
## - latitude 1 0.15 2626.0 -37811
## <none> 2625.8 -37810
## - cleaning_fee 1 0.28 2626.1 -37810
## - host_identity_verified 1 0.33 2626.1 -37810
## + number_of_reviews 1 0.13 2625.7 -37809
## + host_has_profile_pic 1 0.00 2625.8 -37808
## - bed_type 1 1.27 2627.1 -37803
## - instant_bookable 1 4.82 2630.6 -37777
## - host_response_rate 1 5.43 2631.2 -37773
## - property_type 2 6.57 2632.4 -37767
## - amenity_num 1 12.04 2637.8 -37725
## - beds 1 15.46 2641.3 -37700
## - cancellation_policy 4 17.26 2643.1 -37693
## - bathrooms 1 52.85 2678.7 -37432
## - review_scores_rating 1 83.80 2709.6 -37213
## - accommodates 1 90.09 2715.9 -37168
## - bedrooms 1 134.16 2760.0 -36861
## - longitude 1 167.02 2792.8 -36635
## - city 5 201.73 2827.5 -36407
## - room_type 2 1233.32 3859.1 -30464
##
## Step: AIC=-37811.21
## log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## longitude + review_scores_rating + bedrooms + beds + amenity_num
##
## Df Sum of Sq RSS AIC
## <none> 2626.0 -37811
## - cleaning_fee 1 0.29 2626.2 -37811
## - host_identity_verified 1 0.32 2626.3 -37811
## + latitude 1 0.15 2625.8 -37810
## + number_of_reviews 1 0.14 2625.8 -37810
## + host_has_profile_pic 1 0.00 2626.0 -37809
## - bed_type 1 1.28 2627.2 -37804
## - instant_bookable 1 4.79 2630.8 -37778
## - host_response_rate 1 5.43 2631.4 -37774
## - property_type 2 6.64 2632.6 -37767
## - amenity_num 1 12.13 2638.1 -37725
## - beds 1 15.45 2641.4 -37701
## - cancellation_policy 4 17.28 2643.2 -37694
## - bathrooms 1 52.92 2678.9 -37432
## - review_scores_rating 1 83.73 2709.7 -37214
## - accommodates 1 90.04 2716.0 -37170
## - bedrooms 1 134.04 2760.0 -36863
## - longitude 1 167.92 2793.9 -36630
## - city 5 204.41 2830.4 -36390
## - room_type 2 1233.24 3859.2 -30466
summary(steplm3)
##
## Call:
## lm(formula = log_price_ratio ~ property_type + room_type + accommodates +
## bathrooms + bed_type + cancellation_policy + cleaning_fee +
## city + host_identity_verified + host_response_rate + instant_bookable +
## longitude + review_scores_rating + bedrooms + beds + amenity_num,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56204 -0.23730 -0.00908 0.22935 2.51367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.477e+01 2.221e+00 -33.672 < 2e-16 ***
## property_typeHouse -4.147e-02 7.120e-03 -5.824 5.82e-09 ***
## property_typeOther 1.690e-02 8.687e-03 1.945 0.05179 .
## room_typePrivate room -5.897e-01 6.812e-03 -86.571 < 2e-16 ***
## room_typeShared room -1.122e+00 2.002e-02 -56.033 < 2e-16 ***
## accommodates 6.745e-02 2.638e-03 25.566 < 2e-16 ***
## bathrooms 1.166e-01 5.951e-03 19.601 < 2e-16 ***
## bed_typeOther -5.323e-02 1.747e-02 -3.047 0.00231 **
## cancellation_policymoderate 2.212e-02 9.048e-03 2.445 0.01450 *
## cancellation_policystrict 7.183e-02 8.701e-03 8.255 < 2e-16 ***
## cancellation_policysuper_strict_30 3.860e-01 7.989e-02 4.831 1.37e-06 ***
## cancellation_policysuper_strict_60 6.593e-01 2.633e-01 2.504 0.01229 *
## cleaning_feeTRUE -1.131e-02 7.864e-03 -1.439 0.15025
## cityChicago -1.818e+01 5.183e-01 -35.065 < 2e-16 ***
## cityDC -6.526e+00 1.861e-01 -35.065 < 2e-16 ***
## cityLA -5.162e+01 1.476e+00 -34.961 < 2e-16 ***
## cityNYC -3.073e+00 9.051e-02 -33.955 < 2e-16 ***
## citySF -5.599e+01 1.604e+00 -34.901 < 2e-16 ***
## host_identity_verifiedTRUE 1.023e-02 6.673e-03 1.533 0.12522
## host_response_rate -1.710e-01 2.724e-02 -6.279 3.49e-10 ***
## instant_bookableTRUE -3.560e-02 6.034e-03 -5.900 3.71e-09 ***
## longitude -1.091e+00 3.124e-02 -34.913 < 2e-16 ***
## review_scores_rating 1.541e-02 6.251e-04 24.654 < 2e-16 ***
## bedrooms 1.611e-01 5.164e-03 31.194 < 2e-16 ***
## beds -4.020e-02 3.796e-03 -10.590 < 2e-16 ***
## amenity_num 3.949e-03 4.208e-04 9.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3712 on 19062 degrees of freedom
## Multiple R-squared: 0.6509, Adjusted R-squared: 0.6504
## F-statistic: 1422 on 25 and 19062 DF, p-value: < 2.2e-16
p-value, R^2 모두 유의미한 것으로 확인된다.
정규성 및 등분산성 검정
정규성, 등분산성을 만족한다.
예측력
회귀모델의 예측성능을 평가하도록 한다.사용하는 방법은 예측결정계수, 평균절대오차, MAPE, (R)MSE이다.
pred<-predict(steplm3,newdata=test,type='response')
# predictive R^2
cor(test$log_price_ratio,pred)^2
## [1] 0.6365724
# MAE
mean(abs(test$log_price_ratio-pred))
## [1] 0.2896426
# MAPE
mean(abs(test$log_price_ratio-pred)/abs(test$log_price_ratio)*100)
## [1] 6.73777
# RMSE
sqrt(mean((test$log_price_ratio-pred)^2))
## [1] 0.3766409
예측력 또한 충분한 것으로 확인된다.