直接從網站下載檔案後開啟。
stringsAsFactors = F
,輸入的資料型態才不會跑掉#讀取googleplaystore.csv檔
googleplay<-read.delim("googleplaystore.csv", sep = ",",header = T, stringsAsFactors = F)
將各column的資料型態改為適合進行分析的型態。
#資料清洗與資料型態的轉變
googleplay$Category<-as.factor(googleplay$Category)
googleplay$Reviews<-as.numeric((googleplay$Reviews))
googleplay$Price<-as.numeric(gsub("[$]","",googleplay$Price))
googleplay$Installs<-as.numeric(gsub("[+,]","",googleplay$Installs))
googleplay$Type<-as.factor(googleplay$Type)
googleplay<-googleplay[-10473,]
以下顯示清洗後dataset的一小部分。
App | Category | Rating | Reviews | Size | Installs | Type | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 1e+04 | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 5e+05 | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
U Launcher Lite – FREE Live Cool Themes, Hide Apps | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5e+06 | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 5e+07 | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 1e+05 | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
Paper flowers instructions | ART_AND_DESIGN | 4.4 | 167 | 5.6M | 5e+04 | Free | 0 | Everyone | Art & Design | March 26, 2017 | 1.0 | 2.3 and up |
開始進入重點,畫圖!首先引入ggplot2 package
library(ggplot2)
my.plot_a <- ggplot(googleplay, aes(x = Category)) + layer(
geom = "bar", stat = "count", position = "identity",
params = list(
fill = "steelblue", binwidth = 0.1, na.rm = FALSE
)
) + labs(title = "App Types") + theme(
plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"),
plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"),
axis.text.x = element_text(angle = 90, family = "Calibri", hjust = 1, vjust = 0.5)
)
my.plot_a
可見Family, Game, Tools這三種類的app數量最多
my.plot_b <- ggplot(googleplay, aes(x = Rating))+ xlim(0, 5) + layer(
geom = "density", stat = "bin", position = "identity", params = list(
fill = "steelblue",
binwidth = 0.2,
na.rm = FALSE
)
) + labs(title = "Rating distribution") + theme(
plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"),
plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches")
)
my.plot_b
多數App的平均Rating在4~4.5之間。
b_ins<- c(0, 5000, 50000, 500000,5000000, 50000000, Inf)
lev_ins<- c("0\n~5,000", "5,000\n~50,000", "50,000\n~500,000", "500,000\n~5,000,000", "5,000,000\n~50,000,000", ">50,000,000")
my.plot_c <- ggplot(na.omit(googleplay), aes(x = cut(na.omit(googleplay)$Installs, breaks = b_ins, labels = lev_ins), y = Rating)) + geom_boxplot() + labs(title = "Rating vs Installs", x = "Installs") + theme(
plot.background = element_rect(colour = "black",size = 3, linetype = 4, fill = "lightblue"),
plot.title = element_text(colour = "black", face = "bold", size = 30, vjust = 1, hjust = 0.5),
plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches")
)
my.plot_c
本來預期下載次數越多的App,Rating會比較好。圖表證明了這個猜測是對的,不過Rating變化也不大。有趣的是,在Installs<5000中,可以發現平均Rating反而比較高。另外,outlier的分布與installs大小呈現明顯的正相關,也是個值得探討的現象。