Data來源kaggle-googleplaystore

說明

1. Data Input

直接從網站下載檔案後開啟。  
#讀取googleplaystore.csv檔
googleplay<-read.delim("googleplaystore.csv", sep = ",",header = T, stringsAsFactors = F)

2. Data Cleaning

將各column的資料型態改為適合進行分析的型態。  
#資料清洗與資料型態的轉變
googleplay$Category<-as.factor(googleplay$Category)
googleplay$Reviews<-as.numeric((googleplay$Reviews))
googleplay$Price<-as.numeric(gsub("[$]","",googleplay$Price))
googleplay$Installs<-as.numeric(gsub("[+,]","",googleplay$Installs))
googleplay$Type<-as.factor(googleplay$Type)
googleplay<-googleplay[-10473,]
以下顯示清洗後dataset的一小部分。
App Category Rating Reviews Size Installs Type Price Content.Rating Genres Last.Updated Current.Ver Android.Ver
Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 1e+04 Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
Coloring book moana ART_AND_DESIGN 3.9 967 14M 5e+05 Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7 87510 8.7M 5e+06 Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 5e+07 Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 1e+05 Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 5e+04 Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up

3. Plotting

開始進入重點,畫圖!首先引入ggplot2 package
library(ggplot2)

a) App types:

根據不同類型的app數量繪製長條圖。
my.plot_a <- ggplot(googleplay, aes(x = Category)) + layer(
  geom = "bar",  stat = "count",  position = "identity",
  params = list(
    fill = "steelblue",  binwidth = 0.1,  na.rm = FALSE
  )
)  + labs(title = "App Types") + theme(
  plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"), 
  plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
  plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"),
  axis.text.x = element_text(angle = 90, family = "Calibri", hjust = 1, vjust = 0.5)
) 
my.plot_a

可見Family, Game, Tools這三種類的app數量最多  

b) Rating Distribution:

根據所有app評分(1~5)的分布情形繪製density curve
my.plot_b <- ggplot(googleplay, aes(x = Rating))+ xlim(0, 5) + layer(
  geom = "density",  stat = "bin",  position = "identity",  params = list(
    fill = "steelblue",
    binwidth = 0.2,
    na.rm = FALSE
  )
)  + labs(title = "Rating distribution") + theme(
  plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"), 
  plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
  plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches")
)
my.plot_b

多數App的平均Rating在4~4.5之間。

c) Rating vs Installs:

將Installs(下載次數)由小到大分為6個level,繪製Rating的box plot
b_ins<- c(0, 5000, 50000, 500000,5000000, 50000000, Inf)
lev_ins<- c("0\n~5,000", "5,000\n~50,000", "50,000\n~500,000", "500,000\n~5,000,000", "5,000,000\n~50,000,000", ">50,000,000")
my.plot_c <- ggplot(na.omit(googleplay), aes(x = cut(na.omit(googleplay)$Installs, breaks = b_ins, labels = lev_ins), y = Rating)) +  geom_boxplot()  + labs(title = "Rating vs Installs", x = "Installs") + theme(
  plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"), 
  plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
  plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches")
)
my.plot_c

本來預期下載次數越多的App,Rating會比較好。圖表證明了這個猜測是對的,不過Rating變化也不大。有趣的是,在Installs<5000中,可以發現平均Rating反而比較高。另外,outlier的分布與installs大小呈現明顯的正相關,也是個值得探討的現象。