Thursday, 8 December 2016

Classification of mushrooms and determining if they are edible or not

PROBLEM STATEMENT
We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

DATASET DESCRIPTION
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.
We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

Number of instances- 8124
Number of attributes - 22 
Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.
 Class Distribution:
    --    edible: 4208 (51.8%)
    -- poisonous: 3916 (48.2%)
    --     total: 8124 instances

ATTRIBUTE DESCRIPTION
1.cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d



TECHNIQUE USED


 K Nearest Neighbour Classification technique


 The k-Nearest Neighbors algorithm (or k-NN for short) is a non parametric method used for  In both classification and regression  cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

·   In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

·   In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

DATA ANALYSIS USING PYTHON

  •      Loading the relevant libraries 

The first step to data analysis using python was to get all the relevant libraries.
  1. NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
  2. Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
  3. Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
  4. Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
  5. Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

  • Data  exploration 

Now, we read the dataset using read_csv() function. After this we perform data exploration using df.head(10) function which provides us with the first ten rows of the dataset so that we can have a glance at the dataset. df.describe() function is used to get a summary of the numerical attributes in our dataset.



We removed all the rows with Null/NaN values. This step is done because NaN values in features will give an error when we try to fit linear regression model.

  •  KNN Classification technique

We divide our dataset into train and test set. After building the model on the training data using KNN classification technique , we test the model and check the accuracy of the classifier model. The accuracy of the knn classifier is 1.00 out of 1 on training data. The accuracy of the knn classifier is 1.00 out of 1 on test data






CODES  FOR THE ABOVE PROBLEM                                                                                 Link : https://drive.google.com/open?id=0B1BplmDFYhwnRkp5ZlE2eU4yeHM                     



Regression technique applied on Parkinson’s disease dataset

PROBLEM STATEMENT 

To study correlation between different Parkinson's disease variables and to apply regression techniques.

DATASET DESCRIPTION


Parkinson's is a disease of the nervous system that mostly affects older people. It typically begins after the age of 50. The disease can be very hard to live with because it severely restricts mobility and as a result makes daily activities increasingly difficult.
This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.
Columns in the table contain subject number, subject age, subject gender,
time interval from baseline recruitment date, motor UPDRS, total UPDRS, and
16 biomedical voice measures. Each row corresponds to one of 5,875 voice
recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.
Here our objective is to  study the correlations between different parkinson’s disease variables and thereafter use regression techniques.
Data Set Characteristics:  MultivariateAttribute Characteristics:  Integer, RealAssociated Tasks:  RegressionNumber of Instances:  5875Number of Attributes:  26Area:  Life

ATTRIBUTE DESCRIPTION

subject# - Integer that uniquely identifies each subject
age - Subject age
sex - Subject gender '0' - male, '1' - female
test_time - Time since recruitment into the trial. The integer part is the 
number of days since recruitment.
motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated
total_UPDRS - Clinician's total UPDRS score, linearly interpolated
Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of 
variation in fundamental frequency
Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - 
Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation 

TECHNIQUE USED

Regression

In statistical  modellingregression analysis is a statistical process for estimating the relationships
among variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent  variable  and one or more independent variables
(or 'predictors'). More specifically, regression analysis helps one understand how the typical value of
the dependent variable (or 'criterion variable') changes when any one of the independent
variables is varied, while the other independent variables are held fixed. 
DATA ANALYSIS USING PYTHON

  • Loading libaries
To perform data analysis using python , we required the following libraries

  1. NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
  2. Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
  3. Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
  4. Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

  •  Reading the dataset using read_csv( )

After importing the libraries , we read the dataset using read_csv() function.




  • Data Exploration using df.describe( ) and df.head( )                                                                 
After reading the dataset , we can start the process of data exploration. We use the function df.describe( ) to get the summary of the numerical fields in our dataset. It provides us with count , min , max , mean  and the quartiles Q1, Q2 and Q3.








The function df.head(10) prints the first ten rows of the dataset.






  • Correlation matrix to study correlation between different Parkinson's variables
Next we find out the correlation matrix to study the correlation between different parkinson’s doisease variables and also to find out the dependent variable. After some simple calculation we find that the Jitter(percent) has the maximum correlation with all the other variables. So, we take this as our dependent variable and the rest as our independent variables.


  • OLS Regression Technique                                                                                                       
Now, we apply OLS Regression technique  . As it can be seen from the output , only some of the above variables are truly contributing to the regression. In particular, the following do not seem to contribute to the model as much as the other variables because they are less than 2 standard deviations from the mean: x1, x2, x6, x8, x11, x14 and x18. In other words, these variables are accepted by the null hypothesis. Thus, let us see how things change if we ignore these variables. We look at the results obtained after ignoring these variables and find that some of the t-values have changed and the R-squared variable has decreased, but the new model seems to predict the Jitter percentage as well as the old model. Thus, ignoring those certain variables with the low t-scores did 
not do much damage.


  • To check how Ridge regression would work on this dataset                                                  
We also tried to check how Ridge regression would work on this dataset and observed the following plot.




CODES FOR SOVING THE ABOVE PROBLEM 

Link : https://drive.google.com/open?id=0B1BplmDFYhwnMnFSOTl0SWNfQ0U