Recently, I am dealing with tables join with datasets from different data sources. I have imdb dataset and Yahoo movie dataset to build a linear prediction model based on a user's rating history. Here is a small practice with 'dplyr' package in R, the must have technique to have in the data cleaning process. I followed the tutorial from this website:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html #install.packages("dplyr") library(dplyr) data <- read.csv("msleep_ggplot2.csv",header = TRUE) # data link: here head(data) #dim(data) # Selecting columns using select() sleepdata <- select(data, c("name", "sleep_total")) # select(data, name:order) head(select(data,-name)) head(select(data, starts_with("sl"))) # ends_with()/contains()/matches()/one_of() # Selecting rows using filter() filter(data, sleep_total >= 16) filter(data, sleep_total >= 16, bodywt >= 1) filter(data, order %in% c("Perissodactyla", "Primates")) # Pipe operator: %>% head(select(data, name, sleep_total)) data %>% select(name, sleep_total) %>% head # Arrange or re-order rows using arrange() data %>% select(name, order, sleep_total) %>% arrange(order, sleep_total) %>% head ##name order sleep_total ##1 Big brown bat Chiroptera 19.7 ##2 Little brown bat Chiroptera 19.9 ##3 Long-nosed armadillo Cingulata 17.4 ##4 Giant armadillo Cingulata 18.1 ##5 North American Opossum Didelphimorphia 18.0 ##6 Thick-tailed opposum Didelphimorphia 19.4 ##7 Owl monkey Primates 17.0 ##8 Arctic ground squirrel Rodentia 16.6 datadesc <- data %>% select(name, order, sleep_total) %>% arrange(order, desc(sleep_total) %>% filter(sleep_total >= 16) ########### filter ??????? ############ # Create new columns using mutate() data %>% mutate(rem_proportion = sleep_rem / sleep_total) %>% head # Create summaries of the data frame using summarise() data %>% summarise(avg_sleep = mean(sleep_total), min_sleep = min(sleep_total), max_sleep = max(sleep_total), total = n()) # Group operations using group_by() data %>% group_by(order) %>% summarise(avg_sleep = mean(sleep_total), min_sleep = min(sleep_total), max_sleep = max(sleep_total), total = n()) # A tibble: 19 x 5 ##order avg_sleep min_sleep max_sleep total ##<fct> <dbl> <dbl> <dbl> <int> ##1 Afrosoricida 15.6 15.6 15.6 1 ##2 Artiodactyla 4.52 1.90 9.10 6 ##3 Carnivora 10.1 3.50 15.8 12 ##4 Cetacea 4.50 2.70 5.60 3 ##5 Chiroptera 19.8 19.7 19.9 2 ##6 Cingulata 17.8 17.4 18.1 2 ##... ... ##19 Soricomorpha 11.1 8.40 14.9 5
0 Comments
Leave a Reply. |
AuthorLan Jiang is a data analyst with a media industry origin. She is enthusiastically learning about the latest machine learning and data tools to know the audience and customers thoroughly. Archives
January 2019
Categories
All
|