Data Storytelling and visualization: Regression technique applied on Parkinson’s disease dataset

PROBLEM STATEMENT

To study correlation between different Parkinson's disease variables and to apply regression techniques.

DATASET DESCRIPTION

Parkinson's is a disease of the nervous system that mostly affects older people. It typically begins after the age of 50. The disease can be very hard to live with because it severely restricts mobility and as a result makes daily activities increasingly difficult.

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.

Columns in the table contain subject number, subject age, subject gender,

time interval from baseline recruitment date, motor UPDRS, total UPDRS, and

16 biomedical voice measures. Each row corresponds to one of 5,875 voice

recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.

Here our objective is to study the correlations between different parkinson’s disease variables and thereafter use regression techniques.

Data Set Characteristics: MultivariateAttribute Characteristics: Integer, RealAssociated Tasks: RegressionNumber of Instances: 5875Number of Attributes: 26Area: Life

ATTRIBUTE DESCRIPTION

subject# - Integer that uniquely identifies each subject

age - Subject age

sex - Subject gender '0' - male, '1' - female

test_time - Time since recruitment into the trial. The integer part is the

number of days since recruitment.

motor_UPDRS - Clinician's motor UPDRS score, linearly interpolated

total_UPDRS - Clinician's total UPDRS score, linearly interpolated

Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP - Several measures of

variation in fundamental frequency

Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA -

Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

RPDE - A nonlinear dynamical complexity measure

DFA - Signal fractal scaling exponent

PPE - A nonlinear measure of fundamental frequency variation

TECHNIQUE USED

Regression




In statistical  modelling, regression analysis is a statistical process for estimating the relationships 

among  variables. It includes many techniques for modeling and analyzing several variables, when the 

focus is on the relationship between a dependent  variable  and one or more independent variables 

(or 'predictors'). More specifically, regression analysis helps one understand how the typical value of

 the dependent  variable (or 'criterion variable') changes when any one of the independent

variables is varied, while the other independent variables are held  fixed.

DATA ANALYSIS USING PYTHON

Loading libaries

To perform data analysis using python , we required the following libraries

NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.

Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..

Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation.

Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Reading the dataset using read_csv( )

After importing the libraries , we read the dataset using read_csv() function.

Data Exploration using df.describe( ) and df.head( )

After reading the dataset , we can start the process of data exploration. We use the function df.describe( ) to get the summary of the numerical fields in our dataset. It provides us with count , min , max , mean and the quartiles Q1, Q2 and Q3.

The function df.head(10) prints the first ten rows of the dataset.

Correlation matrix to study correlation between different Parkinson's variables

Next we find out the correlation matrix to study the correlation between different parkinson’s doisease variables and also to find out the dependent variable. After some simple calculation we find that the Jitter(percent) has the maximum correlation with all the other variables. So, we take this as our dependent variable and the rest as our independent variables.

OLS Regression Technique

Now, we apply OLS Regression technique . As it can be seen from the output , only some of the above variables are truly contributing to the regression. In particular, the following do not seem to contribute to the model as much as the other variables because they are less than 2 standard deviations from the mean: x1, x2, x6, x8, x11, x14 and x18. In other words, these variables are accepted by the null hypothesis. Thus, let us see how things change if we ignore these variables. We look at the results obtained after ignoring these variables and find that some of the t-values have changed and the R-squared variable has decreased, but the new model seems to predict the Jitter percentage as well as the old model. Thus, ignoring those certain variables with the low t-scores did

not do much damage.