Following my first post on this same series on creating a work station for a Data Scientist, this will be the next one.
Next you need Revolution R Enterprise 7.5.0 Server (RRE-7.5.0) and you can download this from this link.
This will download about 81MB of “Revolution-R-Enterprise-Node-SQL-7.5.0-Windows.EXE”
Once downloaded you can then start installing this exe and below screenshots are provided for a successful installation.
In my previous article, I have explained about the table creation and primary keys.
In real time scenarios, a primary key field should generate values automatically whenever a new row added to the table.
If the primary key field is an integer then set up an identity
CREATE TABLE Users
(Id int primary key identity (1,1),
This table will generate a value for Id starting from 1 and increment by 1. The column values will be 1, 2, 3, .etc.
Primary key on unique identifier field,
CREATE TABLE Users
(Id uniqueidentifier default NEWID() primary key,
This will generate random new GUID value for each row.
Choosing a right primary key column is very challenging.
We can create a primary key on any of the columns but we need to understand the business requirement.
Consider, if you are creating a primary key on Integer field. It has both advantages and disadvantages. Most of the time people will consider its advantages and start using integer as primary key column for all the tables.
Points to remember,
Primary Key on unique identifier field,
Points to remember,
Primary key on varchar field, consider an emailid field
Point to remember,
Creating a table in MS SQL Server is very easy, this can be achieved by using T-SQL scripts or GUI. This article will use T-SQL Scripts to a create a table.
Basic Pre-Requisites to create a table,
Table is the storage unit in a database which stores the data in rows and columns as like excel sheet. The output of the table result is similar to the excel sheet data.
If you are decided to store some data in a database then you must use table.
The basic table structure will look like below.
CREATE TABLE Users
We can add more columns in the structure, each column should be define with the right data type. Here Id is the column name and int is the data type which is nothing but an integer.
Let us discuss, what is a primary key and how to create it.
Primary key is a constraint which we need to define in any one of the column in a table. It will not allow null values in that defined column and entry to the column should be unique i.e. it won’t allow duplicate values.
Choose an ID or Unique Identifier (GUID) column to create a primary key, this is the best way to define a table structure.
Create a table with primary Key on integer column,
CREATE TABLE Users
(Id int primary key,
In the above script, ID column is considered as a primary key. You can check this by clicking the columns folder under dbo.Users table.
Create a table with primary key on unique identifier column,
CREATE TABLE Users
( Id uniqueidentifier primary key,
In this post, I am going to cover how to create the work station that will make it suitable for a Data Scientist. Sounds interesting!? Go ahead reading my other parts on this.
First of all you need Microsoft R Open 3.2.2 for Revolution R Enterprise 7.5.0 which can be downloaded here.
Once you have downloaded this 56MB file, you can start installing and I have provided step by step screenshots below.
There are many methods to check the installed version of R.
version – Provides information about the version of R with additional details
R.version.string – is used to check only the version information
R.Version () – Provides information about the version of R by list
getRversion() – Provides information about the R version string number
You can download the latest Microsoft SQL Server 2016 CTP 3.2 from this link. Ensure to choose the ISO option in this link.
Once downloaded you can start executing the exe file and I have provided enough screenshots to support you with your own installation. My installation is not on my laptop but in the virtual machine. So once you start your Hyper-V Manager and shoot your virtual machine and start the windows. Once booted, you need to go to main menu Media -> DVD Drive -> Insert Disk… option. Once you choose this you will get the below screen and choose your already downloaded SQL Server 2016 CTP 3.2.
Once you have chosen the ISO image, click on Open and you will see the new DVD Drive D: like below screenshot. This is equivalent to inserting your DVD drive into your computer. So now you have the ISO mounted on to the virtual machine and then you are good to go with the installation by double clicking on the setup file, which can be seen below.
Then you will get the SQL Server Installation Center which will be the source for you to plan, install, maintain, advanced resources, etc., In here click on Installation and then choose the first option to create a brand new installation on your system.
Then you will see the below screen, which will then start the setup.
As we have downloaded the CTP 3.2 which is valid for 180 days, you can leave the option button checked with the Free Evaluation edition, from the below screenshot.
Once, you click on next then you will get the screen to accept the License Terms, like shown below. Read the license terms for Pre-Release software license terms and then check the box saying you accept the license terms, this will then enable you to click on Next and proceed further.
The system will then start checking the global rules, product updates and try to install setup files. If it already exists, it will be skipped and if not it will be installing them here automatically. In the below screen, “Download setup files” and “extract setup files” are skipped and “Scan for product updates” completed and “Install setup files” yet to start.
Then you will get the results on the installation rules and gives you the status of each rule. If it failed any one, you cannot proceed your installation and needs to be fixed. But you still can proceed further installation with warnings.
You can click on the Warning link and see the issues in detail. Below screenshot advises to enable appropriate ports to be opened for remote access.
I don’t want to choose all feature with default option as this will install everything and by choosing SQL Server Feature Installation, you have options to choose what do be installed in simple check/uncheck boxes.
Below image provides all the available features that you can choose to install.
I have simply copy pasted (below) the feature description from each of the features that you have from this installation box to give you one point of understanding on each individual features.
The configuration and operation of each instance feature of a SQL Server instance is isolated from other SQL Server instances. SQL Server instances can operate side-by-side on the same computer.
Database Engine Services
Includes the Database Engine, the core service for storing, processing and securing data. The Database Engine provides controlled access and rapid transaction processing and also provides rich support for sustaining high availability. The Database Engine also provides support for the utility control point in the SQL Server Utility. Only Database Engine Services and Analysis Services can be clustered.
SQL Server Replication
Includes a set of technologies for copying and distributing data and database objects from one database to another and synchronizing between the databases for consistency. You can use replication to distribute data to different locations, and to remote and mobile users over local and wide area networks, dial-up connections, wireless connections and the Internet.
Advanced Analytics Extensions
Includes Advanced Analytics Extensions that enables integration with R language using standard T-SQL statements.
Full-Text and Semantic Extractions for Search
Includes the Search engine that supports Full-Text Extraction for fast text search as well as Semantic Extraction for key phrases (likely tags) and similarity search on content stored in SQL Server.
Data Quality Services
Includes Data quality database objects.
PolyBase Query Service for External Data
Includes PolyBase technology that enables truly integrated querying across Hadoop non-relational data and SQL Server relational data using standard T-SQL statements.
Includes Analysis Services and tools used to support online analytical processing (OLAP) and data mining. Only Database Engine Services and Analysis Services can be clustered.
Reporting Services – Native
Includes Reporting Services, a server-based application for creating, managing, and delivering reports to email, multiple file formats, and interactive Web-based formats. The Native mode server provides all processing and management functionality through Reporting Services components. Reporting Services cannot be clustered.
Each shared feature is installed once within a defined scope and operates within that scope. The defined scope can span all SQL Server versions on a computer (e.g., SQL Server Browser), can be isolated to one major version of SQL Server (e.g., SQL Server Management Tools), or can be isolated to one or more minor versions.
Reporting Services – SharePoint
Includes Reporting Services, a server-based application for creating, managing, and delivering reports to email, multiple file formats, and interactive Web-based formats. SharePoint integrated mode integrates the report server with SharePoint products. The report viewing and report management experience are integrated with SharePoint sites and libraries. Reporting Services cannot be clustered.
Reporting Services Add-in for SharePoint Products
Includes management and user interface components to integrate a SharePoint product with an SSRS report server in SharePoint integrated mode. The add-in only needs to be installed on server running a SharePoint product.
Data Quality Client
Includes Data quality client objects.
Client Tools Connectivity
Includes components for communication between clients and servers.
Includes the designer, runtime, and utilities that enable Integration Services to move, integrate, and transform data between data stores.
Client Tools Backwards Compatibility
Contains Client Tools Backwards Compatibility
Client Tools SDK
Includes the software development kit containing resources for programmers.
Installs only the components that you use to view and manage the documentation for SQL Server 2016 CTP3.2. By default, the Help Viewer component uses the online library. After installing SQL Server, you can use the Help Library Manager component to download documentation to your local computer.
Management Tools – Basic
Includes Management Studio support for the Database Engine and SQL Server Express, SQL Server command-line utility (SQLCMD), SQL Server PowerShell provider, and Distributed Replay Administration Tool.
Management Tools – Complete
Adds the following components to the basic management tools installation: Management Studio support for Reporting Services, Analysis Services, and Integration Services technologies, SQL Server Profiler, Database Tuning Advisor, and SQL Server Utility management.
Distributed Replay Controller
Includes the Distributed Replay Controller which orchestrates the actions of the distributed replay clients.
Distributed Replay Client
Includes the Distributed Replay Client. Multiple Distributed Replay Clients work together to simulate a workload against an instance of SQL Server.
SQL Client Connectivity SDK
Includes SQL Server Native Client (ODBC / OLE DB) SDK for database application development.
Master Data Services
Includes Master Data Services, the platform for integrating data from disparate systems across an organization into a single source of master data for accuracy and auditing purposes. Installs the Master Data Services Configuration Manager, assemblies, PowerShell snap-in, and folders and files for Web applications and services.
SQL Server redistributable and shared features are installed when needed: Error and Usage Reporting, SQL Server Native Client, MSXML version 6.0, Sync Services for ADO.NET, and SQL Server Browser.
After reviewing every single feature, I have chosen what is required for me and you can see from below screenshot. I can visit anytime later to install the missing features, when required.
Once I have done the selection for the required features, the next step it will check for the rules for the selected features, like below and gives me the status of Passes/Failed/Warning. Again it is only only warning with post-installation steps are required for my Advanced Analytics Extensions, which I cover in separate posts.
In the next step of Instance configuration, I have options to choose the default or named instance. I have chosen the named instance and provided the Named Instance name. The instance ID will be picked up automatically from the Named Instance. This below screenshot also shows the already existing instances on the same system and I have SQL 2014 installed with the instance name of DAGEOPINS01.
Next in the server configuration screenshot (below), you will have options to choose whether the services should be started Manually everytime the system starts or automatically. Also you have options to choose the account names. Note: In the below screenshot, I have no option to change the startup type for “SQL Server Launchpad” & “SQL Full-text FIlter Daemon Launcher” as they are considered system services and the SQL Server directly controls them.
In the database engine configuration section, it is always good to choose the Mixed mode, so that you get the option to recover the instance in case if you loose your master AD account. By selecting mixed more I have given a password for the System administrator account. Also I have clicked on “Add Current User”, which will automatically select my current logged in account as part of the SQL Server administrator. You can choose more users to be part of admin, by clicking “Add” button.
In the Analysis services configuration, leave the default Multidimensional and data mining mode and choose Add current User and click next.
For the distributed replay controller, I have chosen the Add current user to grant permissions for this service.
For the distributed replay client I have not made any change and you can continue to click next if any changes required.
For the PolyBase Configuration, I have selected the PolyBase scale-out group which enables me to have the SQL Server instance either as a Head or Compute node. Also I have not made any changes to the port range.
After all the steps, finally you get to see the summary of what you have selected and what SQL Server will be installing from the below screenshot.
You can now sit back and relax for about 15 to 20 minutes and the installation will be done by then.
You will now have fully installed version as shown below with status of Succeeded and also you have option to see all the installation steps and log file from the link provided in the below screenshot.
You are now good to go and start using the SQL Server.
Step 1: Log into azure portal
Step 2: Click “Browse” and look for HDInsight Clusters which will have Hadoop icon.
Step 3: Click the HDInsight Clusters and “Add” a new cluster.
Step 4: Provide the cluster name and other details as like below and cluster type should be “Hadoop”.
Step 5: Cluster credentials are mandatory so define the credentials. In this section, you can enable remote desktop by selecting “YES”, need to provide remote desktop credentials.
Step 6: Datasource must be configured for cluster.
Step 7: Node pricing tiers is required to configure as like below.
Step 8: Click the create button when all the steps are filled and configured. HDInsight cluster will be created and deployed for your use.