Thursday, 8 December 2016

Regression technique applied on Parkinson’s disease dataset

PROBLEM STATEMENT 

To study correlation between different Parkinson's disease variables and to apply regression techniques.

DATASET DESCRIPTION


Parkinson's is a disease of the nervous system that mostly affects older people. It typically begins after the age of 50. The disease can be very hard to live with because it severely restricts mobility and as a result makes daily activities increasingly difficult.
This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.
Columns in the table contain subject number, subject age, subject gender,
time interval from baseline recruitment date, motor UPDRS, total UPDRS, and
16 biomedical voice measures. Each row corresponds to one of 5,875 voice
recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.
Here our objective is to  study the correlations between different parkinson’s disease variables and thereafter use regression techniques.
Data Set Characteristics:  MultivariateAttribute Characteristics:  Integer, RealAssociated Tasks:  RegressionNumber of Instances:  5875Number of Attributes:  26Area:  Life

ATTRIBUTE DESCRIPTION

subject# - Integer that uniquely identifies each subject
age - Subject age
sex - Subject gender '0' - male, '1' - female
test_time - Time since recruitment into the trial. The integer part is the 
number of days since recruitment.
motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated
total_UPDRS - Clinician's total UPDRS score, linearly interpolated
Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of 
variation in fundamental frequency
Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA - 
Several measures of variation in amplitude
NHR,HNR - Two measures of ratio of noise to tonal components in the voice
RPDE - A nonlinear dynamical complexity measure
DFA - Signal fractal scaling exponent
PPE - A nonlinear measure of fundamental frequency variation 

TECHNIQUE USED

Regression

In statistical  modellingregression analysis is a statistical process for estimating the relationships
among variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent  variable  and one or more independent variables
(or 'predictors'). More specifically, regression analysis helps one understand how the typical value of
the dependent variable (or 'criterion variable') changes when any one of the independent
variables is varied, while the other independent variables are held fixed. 
DATA ANALYSIS USING PYTHON

  • Loading libaries
To perform data analysis using python , we required the following libraries

  1. NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.
  2. Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
  3. Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.
  4. Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

  •  Reading the dataset using read_csv( )

After importing the libraries , we read the dataset using read_csv() function.




  • Data Exploration using df.describe( ) and df.head( )                                                                 
After reading the dataset , we can start the process of data exploration. We use the function df.describe( ) to get the summary of the numerical fields in our dataset. It provides us with count , min , max , mean  and the quartiles Q1, Q2 and Q3.








The function df.head(10) prints the first ten rows of the dataset.






  • Correlation matrix to study correlation between different Parkinson's variables
Next we find out the correlation matrix to study the correlation between different parkinson’s doisease variables and also to find out the dependent variable. After some simple calculation we find that the Jitter(percent) has the maximum correlation with all the other variables. So, we take this as our dependent variable and the rest as our independent variables.


  • OLS Regression Technique                                                                                                       
Now, we apply OLS Regression technique  . As it can be seen from the output , only some of the above variables are truly contributing to the regression. In particular, the following do not seem to contribute to the model as much as the other variables because they are less than 2 standard deviations from the mean: x1, x2, x6, x8, x11, x14 and x18. In other words, these variables are accepted by the null hypothesis. Thus, let us see how things change if we ignore these variables. We look at the results obtained after ignoring these variables and find that some of the t-values have changed and the R-squared variable has decreased, but the new model seems to predict the Jitter percentage as well as the old model. Thus, ignoring those certain variables with the low t-scores did 
not do much damage.


  • To check how Ridge regression would work on this dataset                                                  
We also tried to check how Ridge regression would work on this dataset and observed the following plot.




CODES FOR SOVING THE ABOVE PROBLEM 

Link : https://drive.google.com/open?id=0B1BplmDFYhwnMnFSOTl0SWNfQ0U

1 comment: