Data Storytelling and visualization: November 2016

Procedure to install Spark 2.0.1 in Linux over virtual machine

Before installing spark , let’s first know what spark is and what are the features that make spark special.

Spark was developed at AMP lab , UC Berkeley by Matei Zaharia .

An interesting feature of spark is that it has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.

Spark has easy-to-use APIs for operating on large datasets.

Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing.

· How does it all begin ? - “ Step 1 “

Let’s now start with the process of getting started with spark. For that , the first step is to create a Linus virtual machine. We use the following link for this purpose : https://my.vmware.com/web/vmware/desktop_end_user_computing/vmware_workstation_pro/12_0

Create a new virtual machine by clicking “ Create new virtual machine “ .

After this the next step is to assign the downloaded Iso image of the Ubuntu Virtual Machine as shown in the image below .

Now after this we assign 25 gb memory for the virtual machine. The process is shown below.

The installation window looks like this.

Once installed , open the virtual machine . UBUNTU desktop looks like this.

· Downloading Anaconda 4.2.0 inside Ubuntu

Use the following link in browser to download Anaconda 4.2.0 inside Ubuntu. https://www.continuum.io/downloads

With this we can download Anaconda 4.0.2 for windows and once download is finished , open the terminal to type the following command after $ symbol

$sudo bash “home/Raashi/download/Anaconda3-4.2.0-Windows-x86_64.exe.”

· Installing Java inside Ubuntu

By default Ubuntu VMware image does not come with Java runtime environment which is essential to run Spark.

Let’s see the steps to install Java runtime environment inside Ubuntu.

Open the terminal window. Type the following codes : username@ubuntu:$ sudo apt-get update , followed by the following line of code –

username@ubuntu:$ sudo apt-get install default-jre

This is the latest stable version of Java at time of writing, and the recommended version to install. You can do so using the following command:

sudo apt-get install oracle-java8-installer

Now with this Java platform is ready to use.

· Downloading and installing spark in Ubuntu

Use the following link to download the apache Spark 2.0.1 http://spark.apache.org/downloads.html > Once it is downloaded ,unzip the .tgz file , extract the contents of package in home directory and type following code in terminal -

export SPARK_HOME=/home/desktop/ spark-2.0.0-bin-hadoop2.7

export PATH=$SPARK_HOME/bin:$PATH

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATHexport SPARK_LOCAL_IP=LOCALHOST

Here the above process looks like this.

> now paste the same 5 codes in terminal and press enter > now it installs spark > after successful installation > to start spark type the following command in terminal > $ pyspark > press enter > and to close spark type exit().

Now you should see following image.

If you could see this screen that means you have successfully installed Spark 2.0.1 on Python3.5 in your Ubuntu system.

Let’s do some sample program in Spark to check everything working well.

· Word Count Program in Spark:

Open terminal in Ubuntu and type $ pyspark > and type the following commands in line by line and press enter.

Program:

text = sc.textFile("hobbit.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")

The screen shot of above program in spark look like this.

The output screen shot looks like this:

· Word Count Program in ipython notebook:

Inside Ubuntu open terminal > type following command which redirect you to ipython notebook in browser.

PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

In ipython notebook > click new > python (conda root) > and here paste the above codes one by one and run the cell.

The screen shot of above process looks like this.

The output will look like this:

Tuesday, 15 November 2016