Check the Linear regression introduction and how to use excel for linear regression.
In this post, let me explain how to use R programming for linear regression.
I am taking the same dataset in R to produce the linear regression.
Run the below codes.
Year<- c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010)
SalesAmount<-c(1450,1486,1560,1200,1205,1100,1150,1750,1650,1500,1900)
#linear regression
result<-lm(SalesAmount~Year)
#display coefficient
result
# new value for year 2011
year2011 <- data.frame(Year=2011)
year2011
#predict the sales amount.
predict (result,year2011)
Before starting with linear regression, let us try to understand the machine learning concepts.
As of today, machine learning is one of the trending technologies as the applications are started to consume real-time data and deliver machine learning services.
As part of machine learning, we have three different types of machine learning models available.
In Supervised Learning, the system tries to learn from the previous examples that are given.
Under supervised learning, we have classification and regression models.
One of the common and simple techniques for predicting a continuous variable is called linear regression.
It assumes a linear relationship between the outcome and the predictor variables.
The linear regression equation can be written as y = b0 + b*x + e, where:
This post covers the linear regression on excel and R studio.
Linear Regression on Excel
Follow the below steps to perform the linear regression in excel
Step 1: Consider you have year wise sales amount. You are in the situation to predict the sales amount for the year 2011.
Step 2: Create a scatter plot like below with Year and Sales Amount
Step 3: Right-click on the line and choose trendline.
Step 4: Display the equation on the trendline by selecting the checkbox on the Format trendline properties.
Step 5: Predict the value using the generated formula for existing values.
Step 6: Predict for the year 2011.
As you can see the values are not 100% correct, we need to adjust the con-efficient and reduce the r2 values to get the perfect result.
Following my previous post, we have another few functions on dplyr package which will be covered in this post.
As part of data modelling, we need to sort the data to analyze further.
In T-SQL we can easily perform the sort on the data as like below. It uses “Order by” clause to sort the data.
In R, we need to use arrange function.
#Arrange
DF_Arrange<- DF_select %>%
group_by(State) %>%
summarise(Total = sum(Price)) %>%
arrange(desc(Total))
DF_Arrange
The next, very common task is to build the calculated column to satisfy the business logic. This can be easily done with the help of “mutate” function.
#mutate
DF_mutate <- DF_select %>%
group_by(State) %>%
summarise(Total = sum(Price)) %>%
mutate(“10%of Total” = Total/10) %>%
arrange(desc(Total))
DF_mutate
R is one of the very famous tools to handle the data science projects because it has all the capabilities right from the extracting the data from different sources, data modelling and transformation, data visualization and finally building machine learning models using the data.
This post explains how the data modelling can be done with R using “dplyr” package.
To make it easy, let me compare this dplyr function with T-SQL.
First, let us load the data into R studio.
fulldata<-read.csv(“D:\\Projects 2018\\DSKA\\DataWranglingDemo.csv”)
fulldata
When we have a large dataset with more than 1000 columns, if we need only certain columns then we can use “SELECT” option to choose the specific columns. Check the below example,
The same option is available in R with dplyr package.
#install.packages(“dplyr”)
require(“dplyr”)
#select
DF_select <- fulldata %>% select (State,Price)
DF_select
In T-SQL, we have where condition to filter any data with different conditions.
The same filter option is available with dplyr functions.
#filter
DF_select %>% filter(State==”Alabama”)
Same as T-SQL, we have group by, summarise functions are available in R
#Group by
DF_Group <- DF_select %>% group_by(State)
DF_Group
#Summarise
DF_Sum<- DF_select %>% group_by(State) %>%
summarise(Total = sum(Price))
DF_Sum
Power BI service supporting many R packages as of now and it will increase in future. There is a misunderstanding that we can use these packages in Power BI Desktop R visual.
We can’t use all the packages in the Power BI site because it includes the packages which can be used for R custom visuals development.
Check the list of packages that Power BI Service supports, here.
Let us take an example,
There is a package called “timevis” which is part of Power BI supported packages.
We can use this package to build a visual to display a time range information. I used this package in R studio and got the below visual.
You can use the below R script and try it by yourself. This package requires many dependent packages which you can identify while installing the R scripts. If any dependent packages are missing then you should install those missing packages and finally run the R script.
library(timevis)
input<- data.frame(
id = 1:4,
content = c(“Apple”, “Orange”, “Mango”, “Banana”),
start = c(“2017-01-10”, “2017-01-11”, “2017-01-20”, “2017-02-14 15:00:00”),
end = c(“2017-02-13”, NA, “2017-02-04”, NA)
)
timevis(input)
It is easy to identify which packages are supported in Power BI from the list. Prepare your R script from any of the packages and make sure to check the supported package list and run the script in R studio.
If your package producing a visual and make sure that visual is displayed under plot section of R studio. If you see the visuals in plot section then those packages can be used in Power BI.
In our example, the visual is displayed under viewer section which means it is using HTML viewer which is not yet supported by R visual in Power BI.
Check out my previous post to see the streamgraph chart comparison.
In this post, I have taken Sankey chart custom visual. The power bi custom visual has less functionality compare than R visuals
Power BI Desktop
We have a Sankey chart custom visual available in Power BI. Download and import into Power BI Desktop.
https://store.office.com/powerbiaddininstallpage.aspx?rs=en-US&assetid=WA104380777
Here I used a dataset called Titanic. This is one of the datasets in R which I used for this post.
R Studio
To get the sankey visual, we need to install the alluvial package.
Run the below script in R Studio or R console application.
install.packages(‘alluvial’)
tit <- as.data.frame(Titanic)
library(alluvial)
alluvial( tit[,1:4], freq=tit$Freq, border=NA,
hide = tit$Freq < quantile(tit$Freq, .50),
col=ifelse( tit$Survived == “No”, “red”, “yellow”) )
Power BI Desktop | R Studio |
Import Sankey Chart visual | Install alluvial and related packages |
Less Functionality | More Functionality |
GUI Type | R Script Type |
Only 2 levels of visual by default for the simple dataset. To see multi-level visuals then need to prepare a dataset with multi-level of source and destinations | Multi-Levels of visual – Just add the number of fields. |
R language has many built in functions, in which statistical functions are mostly used on numeric vectors.
sum(x) – The sum of vector x
min(x) – The Minimum value of vector x
max(x) – The maximum value of vector x
mean(x) – The arithmetic mean of vector x
median(x) – The median of numeric vector x. 50% of data should be less than median and balance 50% data should be greater than median.
sd(x) – The standard deviation of vector x
var(x) – The Variance of numeric vector x
quantile(x,p) – The Pth sample quantile of numeric vector x. for example, quantile(x,.3) will tell us the value at which 30% of cases are less than value x.
summary(x) – It shows several statistics of vector x, including the above.
Practical Space,
X<-c(2,4,6,8,9,10,14,56,70)
sum(X)
min(X)
max(X)
mean(X)
median(X)
sd(X)
var(X)
quantile(X,0.5)
summary(X)
In a real time, we will be in a position to handle the logic using loops. Loop concept is available in all programming language and also in R.
For example, if we want to generate first 200 prime numbers, we have to use loop to get the result.
In R, “for loop” is used to iterate over a vector.
Syntax
for (val in sequence) {
statement
}
In the above syntax, sequence is a vector which we can define and assign into an object and can use the object or we can directly place the vector with the expression. “val” will take the value of sequence and process in a loop. The statement will be triggered until the length of the sequence is completed.
Example,
x <- c(1,2,3,4,5,6,7,8,9,10)
count <- 0
for (val in x) {
if(val %% 2 == 0) count = count+1
}
print(count)
Using Next
We can use the word next inside the for loop and it will change the flow of the process. Check the below example,
for (i in 1:10) {
if (!i %% 2){
}
print(i)
}
If you run above script, I will get the result as 1,2,3,4,5,6,7,8,9,10 (All the values from the 1:10 sequence).
If you want to display only the odd number for the same sequence input then I can just add a “next” keyword inside if loop.
for (i in 1:10) {
if (!i %% 2){
next
}
print(i)
}
First “i” value 1 will go to if condition once the value reaches the keyword “next”. It will redirect to for loop again and it will come again for the second number i.e. 2 and it will continue up to 10.