PROBLEM STATEMENT
To study correlation between different Parkinson's disease variables and to apply regression techniques.
DATASET DESCRIPTION
To study correlation between different Parkinson's disease variables and to apply regression techniques.
DATASET DESCRIPTION
Parkinson's is a disease of the nervous system that mostly affects older people. It
typically begins after the age of 50. The disease can be very hard to live with
because it severely restricts mobility and as a result makes daily activities
increasingly difficult.
This
dataset is composed of a range of biomedical voice measurements from 42 people
with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring
device for remote symptom progression monitoring. The recordings were
automatically captured in the patient's homes.
Columns
in the table contain subject number, subject age, subject gender,
time
interval from baseline recruitment date, motor UPDRS, total UPDRS, and
16
biomedical voice measures. Each row corresponds to one of 5,875 voice
recording
from these individuals. The main aim of the data is to predict the motor and
total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice
measures.
Here our objective is to study the correlations between different
parkinson’s disease variables and thereafter use regression techniques.
Data Set Characteristics: MultivariateAttribute Characteristics: Integer, RealAssociated Tasks: RegressionNumber of Instances: 5875Number of Attributes: 26Area: Life
ATTRIBUTE DESCRIPTION
subject# - Integer that uniquely identifies each subject
age - Subject age
sex - Subject gender '0' - male, '1' - female
test_time - Time since recruitment into the trial. The integer part is the
number of days since recruitment.
motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated
total_UPDRS - Clinician's total UPDRS score, linearly interpolated
Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of
variation in fundamental frequency
Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA -
Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation
TECHNIQUE USED
Regression
In statistical modelling, regression analysis is a statistical process for estimating the relationshipsamong variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or more independent variables
(or 'predictors'). More specifically, regression analysis helps one understand how the typical value of
the dependent variable (or 'criterion variable') changes when any one of the independentvariables is varied, while the other independent variables are held fixed.
DATA ANALYSIS USING PYTHON
- Loading libaries
- NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
- Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
- Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
- Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
- Reading the dataset using read_csv( )
- Data Exploration using df.describe( ) and df.head( )
- Correlation matrix to study correlation between different Parkinson's variables
Next we find out the correlation matrix to study the
correlation between different parkinson’s doisease variables and also to find
out the dependent variable. After some simple calculation we find that the
Jitter(percent) has the maximum correlation with all the other variables. So,
we take this as our dependent variable and the rest as our independent
variables.
- OLS Regression Technique
not do much
damage.
- To check how Ridge regression would work on this dataset
Link : https://drive.google.com/open?id=0B1BplmDFYhwnMnFSOTl0SWNfQ0U
ok
ReplyDelete