Data Storytelling and visualization: December 2016

Thursday, 8 December 2016

Classification of mushrooms and determining if they are edible or not

PROBLEM STATEMENT

We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

DATASET DESCRIPTION

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

Number of instances- 8124

Number of attributes - 22

Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.

Class Distribution:

-- edible: 4208 (51.8%)

-- poisonous: 3916 (48.2%)

-- total: 8124 instances

ATTRIBUTE DESCRIPTION

1.cap-shape: bell=b,conical=c,convex=x,flat=f,

knobbed=k,sunken=s

2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y

4. bruises?: bruises=t,no=f

5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s

6. gill-attachment: attached=a,descending=d,free=f,notched=n

7. gill-spacing: close=c,crowded=w,distant=d

8. gill-size: broad=b,narrow=n

9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y

10. stalk-shape: enlarging=e,tapering=t

11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?

12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y

15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

16. veil-type: partial=p,universal=u

17. veil-color: brown=n,orange=o,white=w,yellow=y

18. ring-number: none=n,one=o,two=t

19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z

20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y

21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

TECHNIQUE USED

K Nearest Neighbour Classification technique

The k-Nearest Neighbors algorithm (or k-NN for short) is a non parametric method used for In both classification and regression cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

· In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

· In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

DATA ANALYSIS USING PYTHON

Loading the relevant libraries

The first step to data analysis using python was to get all the relevant libraries.

NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

Data exploration

Now, we read the dataset using read_csv() function. After this we perform data exploration using df.head(10) function which provides us with the first ten rows of the dataset so that we can have a glance at the dataset. df.describe() function is used to get a summary of the numerical attributes in our dataset.

We removed all the rows with Null/NaN values. This step is done because NaN values in features will give an error when we try to fit linear regression model.

KNN Classification technique

We divide our dataset into train and test set. After building the model on the training data using KNN classification technique , we test the model and check the accuracy of the classifier model. The accuracy of the knn classifier is 1.00 out of 1 on training data. The accuracy of the knn classifier is 1.00 out of 1 on test data

CODES FOR THE ABOVE PROBLEM Link : https://drive.google.com/open?id=0B1BplmDFYhwnRkp5ZlE2eU4yeHM

Regression technique applied on Parkinson’s disease dataset

PROBLEM STATEMENT

To study correlation between different Parkinson's disease variables and to apply regression techniques.

DATASET DESCRIPTION

Parkinson's is a disease of the nervous system that mostly affects older people. It typically begins after the age of 50. The disease can be very hard to live with because it severely restricts mobility and as a result makes daily activities increasingly difficult.

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.

Columns in the table contain subject number, subject age, subject gender,

time interval from baseline recruitment date, motor UPDRS, total UPDRS, and

16 biomedical voice measures. Each row corresponds to one of 5,875 voice

recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.

Here our objective is to study the correlations between different parkinson’s disease variables and thereafter use regression techniques.

Data Set Characteristics: MultivariateAttribute Characteristics: Integer, RealAssociated Tasks: RegressionNumber of Instances: 5875Number of Attributes: 26Area: Life

ATTRIBUTE DESCRIPTION

subject# - Integer that uniquely identifies each subject

age - Subject age

sex - Subject gender '0' - male, '1' - female

test_time - Time since recruitment into the trial. The integer part is the

number of days since recruitment.

motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated

total_UPDRS - Clinician's total UPDRS score, linearly interpolated

Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of

variation in fundamental frequency

Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA -

Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

RPDE - A nonlinear dynamical complexity measure

DFA - Signal fractal scaling exponent

PPE - A nonlinear measure of fundamental frequency variation

TECHNIQUE USED

Regression




In statistical  modelling, regression analysis is a statistical process for estimating the relationships 

among  variables. It includes many techniques for modeling and analyzing several variables, when the 

focus is on the relationship between a dependent  variable  and one or more independent variables 

(or 'predictors'). More specifically, regression analysis helps one understand how the typical value of

 the dependent  variable (or 'criterion variable') changes when any one of the independent

variables is varied, while the other independent variables are held  fixed.

DATA ANALYSIS USING PYTHON

Loading libaries

To perform data analysis using python , we required the following libraries

NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.

Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..

Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.

Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Reading the dataset using read_csv( )

After importing the libraries , we read the dataset using read_csv() function.

Data Exploration using df.describe( ) and df.head( )

After reading the dataset , we can start the process of data exploration. We use the function df.describe( ) to get the summary of the numerical fields in our dataset. It provides us with count , min , max , mean and the quartiles Q1, Q2 and Q3.

The function df.head(10) prints the first ten rows of the dataset.

Correlation matrix to study correlation between different Parkinson's variables

Next we find out the correlation matrix to study the correlation between different parkinson’s doisease variables and also to find out the dependent variable. After some simple calculation we find that the Jitter(percent) has the maximum correlation with all the other variables. So, we take this as our dependent variable and the rest as our independent variables.

OLS Regression Technique

Now, we apply OLS Regression technique . As it can be seen from the output , only some of the above variables are truly contributing to the regression. In particular, the following do not seem to contribute to the model as much as the other variables because they are less than 2 standard deviations from the mean: x1, x2, x6, x8, x11, x14 and x18. In other words, these variables are accepted by the null hypothesis. Thus, let us see how things change if we ignore these variables. We look at the results obtained after ignoring these variables and find that some of the t-values have changed and the R-squared variable has decreased, but the new model seems to predict the Jitter percentage as well as the old model. Thus, ignoring those certain variables with the low t-scores did

not do much damage.

To check how Ridge regression would work on this dataset

We also tried to check how Ridge regression would work on this dataset and observed the following plot.

CODES FOR SOVING THE ABOVE PROBLEM

Link : https://drive.google.com/open?id=0B1BplmDFYhwnMnFSOTl0SWNfQ0U