Procedure to install Spark 2.0.1 in Linux over virtual
machine
Before
installing spark , let’s first know what spark is and what are the features
that make spark special.
Spark was developed at
AMP lab , UC Berkeley by Matei Zaharia .
An interesting feature of spark is
that it has quickly become the largest open source
community in big data, with over 1000 contributors from 250+ organizations.
Spark has easy-to-use APIs for operating on large datasets.
Spark comes packaged with higher-level libraries, including support for
SQL queries, streaming data, machine learning and graph processing.
· How
does it all begin ? - “ Step 1 “
Let’s now start with the process of getting started with spark.
For that , the first step is to create a Linus virtual machine. We use the following link
for this purpose : https://my.vmware.com/web/vmware/desktop_end_user_computing/vmware_workstation_pro/12_0
Create
a new virtual machine by clicking “ Create new virtual machine “ .
After this
the next step is to assign the downloaded Iso image of the Ubuntu Virtual
Machine as shown in the image below .
Now
after this we assign 25 gb memory for the virtual machine. The process is shown
below.
The
installation window looks like this.
Once
installed , open the virtual machine . UBUNTU desktop looks like this.
- · Downloading Anaconda 4.2.0 inside Ubuntu
Use the following link in browser to download
Anaconda 4.2.0 inside Ubuntu. https://www.continuum.io/downloads
With this we can download Anaconda 4.0.2 for
windows and once download is finished ,
open the terminal to type the following command after $ symbol
$sudo
bash “home/Raashi/download/Anaconda3-4.2.0-Windows-x86_64.exe.”
- · Installing Java inside Ubuntu
By default Ubuntu VMware
image does not come with Java runtime environment which is essential to run
Spark.
Let’s see the steps to
install Java runtime environment inside Ubuntu.
Open the terminal window.
Type the following codes : username@ubuntu:$ sudo apt-get update , followed by
the following line of code –
username@ubuntu:$ sudo
apt-get install default-jre
This is the latest stable
version of Java at time of writing, and the recommended version to install. You
can do so using the following command:
sudo apt-get install
oracle-java8-installer
Now with this Java platform
is ready to use.
- · Downloading and installing spark in Ubuntu
Use the following link to
download the apache Spark 2.0.1 http://spark.apache.org/downloads.html > Once it is downloaded ,unzip the .tgz file , extract the contents of package in home directory and type following code in terminal -
export SPARK_HOME=/home/desktop/ spark-2.0.0-bin-hadoop2.7
export
PATH=$SPARK_HOME/bin:$PATH
export
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATHexport
SPARK_LOCAL_IP=LOCALHOST
Here the above process
looks like this.
> now paste the same 5
codes in terminal and press enter > now it installs spark > after
successful installation > to start spark type the following command in terminal > $ pyspark > press
enter > and to close spark type exit().
Now you should see
following image.
If you
could see this screen that means you have successfully installed Spark 2.0.1 on
Python3.5 in your Ubuntu system.
Let’s
do some sample program in Spark to check everything working well.
- · Word Count Program in Spark:
Open
terminal in Ubuntu and type $ pyspark > and type the following commands
in line by line and press enter.
Program:
text =
sc.textFile("hobbit.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")
The screen shot of above program in spark
look like this.
The
output screen shot looks like this:
- · Word Count Program in ipython notebook:
Inside
Ubuntu open terminal > type following
command which redirect you to ipython notebook in browser.
PYSPARK_PYTHON=python3
PYSPARK_DRIVER_PYTHON=ipython3 PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark
In
ipython notebook > click new > python (conda root) > and here paste
the above codes one by one and run the cell.
The
screen shot of above process looks like this.
The
output will look like this:















