PROBLEM STATEMENT
We have to classify the mushrooms on the basis of the given attributes
like cap shape, cap surface, cap color , etc and determine if they are edible
or not.
DATASET DESCRIPTION
This data set includes descriptions of hypothetical samples
corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota
Family (pp. 500-525). Each species is identified as definitely edible,
definitely poisonous, or of unknown edibility and not recommended. This latter
class was combined with the poisonous one. The Guide clearly states that there
is no simple rule for determining the edibility of a mushroom; no rule like
``leaflets three, let it be'' for Poisonous Oak and Ivy.
We have to classify the mushrooms on the basis of the given attributes
like cap shape, cap surface, cap color , etc and determine if they are edible
or not.
Number of instances- 8124
Number of attributes - 22
Missing Attribute Values: 2480 of them (denoted by "?"), all
for attribute #11.
Class
Distribution:
--
edible: 4208 (51.8%)
-- poisonous: 3916 (48.2%)
-- total: 8124 instances
ATTRIBUTE
DESCRIPTION
1.cap-shape:
bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
2.
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3.
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
4.
bruises?: bruises=t,no=f
5. odor:
almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
6.
gill-attachment: attached=a,descending=d,free=f,notched=n
7.
gill-spacing: close=c,crowded=w,distant=d
8.
gill-size: broad=b,narrow=n
9.
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,
green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10.
stalk-shape: enlarging=e,tapering=t
11.
stalk-root: bulbous=b,club=c,cup=u,equal=e,
rhizomorphs=z,rooted=r,missing=?
12.
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13.
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14.
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
15.
stalk-color-below-ring:
brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
16.
veil-type: partial=p,universal=u
17.
veil-color: brown=n,orange=o,white=w,yellow=y
18.
ring-number: none=n,one=o,two=t
19.
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,
none=n,pendant=p,sheathing=s,zone=z
20.
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,
orange=o,purple=u,white=w,yellow=y
21.
population: abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
22.
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
TECHNIQUE
USED
K Nearest
Neighbour Classification technique
The k-Nearest
Neighbors algorithm (or k-NN for short) is
a non parametric method used for In
both classification and regression cases, the input consists of the k closest training
examples in the feature space. The output depends on
whether k-NN is used for classification or regression:
· In k-NN
classification, the output is a class membership. An object is classified
by a majority vote of its neighbors, with the object being assigned to the
class most common among its k nearest neighbors (k is
a positive integer,
typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbor.
· In k-NN
regression, the output is the property value for the object. This value is
the average of the values of its k nearest neighbors.
DATA
ANALYSIS USING PYTHON
- Loading the relevant libraries
The first
step to data analysis using python was to get all the relevant libraries.
- NumPy stands for Numerical Python. The most powerful feature of
NumPy is n-dimensional array. This library also contains basic linear
algebra functions, Fourier transforms, advanced random number
capabilities and tools for integration with other low level languages like
Fortran, C and C++.
- Matplotlib for plotting vast variety of graphs, starting from histograms
to line plots to heat plots..
- Pandas for structured data operations and manipulations. It is
extensively used for data munging and preparation.
- Scikit
Learn for machine learning. Built on NumPy,
SciPy and matplotlib, this library contains a lot of effiecient tools for
machine learning and statistical modeling including classification,
regression, clustering and dimensionality reduction.
- Statsmodels for statistical modeling. Statsmodels is a Python module that allows
users to explore data, estimate statistical models, and perform
statistical tests.
- Data exploration
Now, we
read the dataset using read_csv()
function. After this we perform data exploration using df.head(10) function
which provides us with the first ten rows of the dataset so that we can have a
glance at the dataset. df.describe() function is used to get a summary of
the numerical attributes in our dataset.
We removed
all the rows with Null/NaN
values. This step is done because NaN values in features will give an error when
we try to fit linear regression model.
- KNN Classification technique
We divide
our dataset into train and test set. After building the model on the training
data using KNN classification technique , we test the model and check the
accuracy of the classifier model. The accuracy of the knn classifier is 1.00
out of 1 on training data. The
accuracy of the knn classifier is 1.00 out of 1 on test data
CODES FOR THE ABOVE PROBLEM Link : https://drive.google.com/open?id=0B1BplmDFYhwnRkp5ZlE2eU4yeHM