R is one of the very famous tools to handle the data science projects because it has all the capabilities right from the extracting the data from different sources, data modelling and transformation, data visualization and finally building machine learning models using the data.
This post explains how the data modelling can be done with R using “dplyr” package.
To make it easy, let me compare this dplyr function with T-SQL.
First, let us load the data into R studio.
fulldata<-read.csv(“D:\\Projects 2018\\DSKA\\DataWranglingDemo.csv”)
fulldata
When we have a large dataset with more than 1000 columns, if we need only certain columns then we can use “SELECT” option to choose the specific columns. Check the below example,
The same option is available in R with dplyr package.
#install.packages(“dplyr”)
require(“dplyr”)
#select
DF_select <- fulldata %>% select (State,Price)
DF_select
In T-SQL, we have where condition to filter any data with different conditions.
The same filter option is available with dplyr functions.
#filter
DF_select %>% filter(State==”Alabama”)
Same as T-SQL, we have group by, summarise functions are available in R
#Group by
DF_Group <- DF_select %>% group_by(State)
DF_Group
#Summarise
DF_Sum<- DF_select %>% group_by(State) %>%
summarise(Total = sum(Price))
DF_Sum
About the Author