Data Storytelling and visualization: Classification of mushrooms and determining if they are edible or not

PROBLEM STATEMENT

We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

DATASET DESCRIPTION

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

We have to classify the mushrooms on the basis of the given attributes like cap shape, cap surface, cap color , etc and determine if they are edible or not.

Number of instances- 8124

Number of attributes - 22

Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.

Class Distribution:

-- edible: 4208 (51.8%)

-- poisonous: 3916 (48.2%)

-- total: 8124 instances

ATTRIBUTE DESCRIPTION

1.cap-shape: bell=b,conical=c,convex=x,flat=f,

knobbed=k,sunken=s

2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y

4. bruises?: bruises=t,no=f

5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s

6. gill-attachment: attached=a,descending=d,free=f,notched=n

7. gill-spacing: close=c,crowded=w,distant=d

8. gill-size: broad=b,narrow=n

9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y

10. stalk-shape: enlarging=e,tapering=t

11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?

12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y

15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

16. veil-type: partial=p,universal=u

17. veil-color: brown=n,orange=o,white=w,yellow=y

18. ring-number: none=n,one=o,two=t

19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z

20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y

21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y

22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

TECHNIQUE USED

K Nearest Neighbour Classification technique

The k-Nearest Neighbors algorithm (or k-NN for short) is a non parametric method used for In both classification and regression cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

· In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

· In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

DATA ANALYSIS USING PYTHON

Loading the relevant libraries

The first step to data analysis using python was to get all the relevant libraries.

NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

Data exploration

Now, we read the dataset using read_csv() function. After this we perform data exploration using df.head(10) function which provides us with the first ten rows of the dataset so that we can have a glance at the dataset. df.describe() function is used to get a summary of the numerical attributes in our dataset.

We removed all the rows with Null/NaN values. This step is done because NaN values in features will give an error when we try to fit linear regression model.

KNN Classification technique

We divide our dataset into train and test set. After building the model on the training data using KNN classification technique , we test the model and check the accuracy of the classifier model. The accuracy of the knn classifier is 1.00 out of 1 on training data. The accuracy of the knn classifier is 1.00 out of 1 on test data

CODES FOR THE ABOVE PROBLEM Link : https://drive.google.com/open?id=0B1BplmDFYhwnRkp5ZlE2eU4yeHM

Data Storytelling and visualization

Thursday, 8 December 2016

Classification of mushrooms and determining if they are edible or not

1 comment: