dplyr in r practice

4/23/2018

Recently, I am dealing with tables join with datasets from different data sources. I have imdb dataset and Yahoo movie dataset to build a linear prediction model based on a user's rating history.

Here is a small practice with 'dplyr' package in R, the must have technique to have in the data cleaning process.

I followed the tutorial from this website:
http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

#install.packages("dplyr")
library(dplyr)

data <- read.csv("msleep_ggplot2.csv",header = TRUE)
# data link: here
head(data)
#dim(data)

# Selecting columns using select()
sleepdata <- select(data, c("name", "sleep_total")) # select(data, name:order)
head(select(data,-name))

head(select(data, starts_with("sl"))) # ends_with()/contains()/matches()/one_of()

# Selecting rows using filter()
filter(data, sleep_total >= 16)
filter(data, sleep_total >= 16, bodywt >= 1)
filter(data, order %in% c("Perissodactyla", "Primates"))

# Pipe operator: %>%
head(select(data, name, sleep_total))
data %>%
select(name, sleep_total) %>%
head

# Arrange or re-order rows using arrange()
data %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
head
##name order sleep_total
##1 Big brown bat Chiroptera 19.7
##2 Little brown bat Chiroptera 19.9
##3 Long-nosed armadillo Cingulata 17.4
##4 Giant armadillo Cingulata 18.1
##5 North American Opossum Didelphimorphia 18.0
##6 Thick-tailed opposum Didelphimorphia 19.4
##7 Owl monkey Primates 17.0
##8 Arctic ground squirrel Rodentia 16.6

datadesc <- data %>%
select(name, order, sleep_total) %>%
arrange(order, desc(sleep_total) %>%
filter(sleep_total >= 16) ########### filter ??????? ############

# Create new columns using mutate()
data %>%
mutate(rem_proportion = sleep_rem / sleep_total) %>%
head

# Create summaries of the data frame using summarise()
data %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())

# Group operations using group_by()
data %>%
group_by(order) %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
# A tibble: 19 x 5
##order avg_sleep min_sleep max_sleep total
##<fct> <dbl> <dbl> <dbl> <int>
##1 Afrosoricida 15.6 15.6 15.6 1
##2 Artiodactyla 4.52 1.90 9.10 6
##3 Carnivora 10.1 3.50 15.8 12
##4 Cetacea 4.50 2.70 5.60 3
##5 Chiroptera 19.8 19.7 19.9 2
##6 Cingulata 17.8 17.4 18.1 2
##... ...
##19 Soricomorpha 11.1 8.40 14.9 5

0 Comments

Data Science Blog

dplyr in r practice

Leave a Reply.

Author

Archives

Categories