Sept 28

My progress with a particular emphasis on Linear Regression as I continue to work on my project utilizing data from the Centers for Disease Control and Prevention (CDC) on rates of diabetes, obesity, and physical activity in US counties.
Summary of Linear Regression

I’ve already started down the road of linear regression in my investigation of this huge dataset. It’s encouraging to report that no meaningful evidence has emerged throughout my review of the data after performing 5 fold coss validation and thoroughly going over the data.Cross-Validation’s main principle is to divide my dataset into numerous subsets, or “folds.” While the remaining folds are used for training, each fold functions as a distinct test set. I get a more precise assessment of the performance of my model by performing more iterations.

Sept 27

Today I learned about the 5 fold Cross-validation, in today’s morning session we performed this 5-Fold Cross-validation to the dataset that has 354 data points for this we initially split the dataset randomly into five roughly equal-sized parts, or “folds.” now each fold has 71 data points and the last fold has 70 datapoints. In this process we perform five iterations and In each iteration, one fold is held out as the test set, while the other four folds are used for training the model. This process involves training a polynomial regression model and measure average performance of the model.

In the mondays class update I will post about what progress we made on the dataset.

Sept-25

Today I went through the Estimating Prediction Error and Validation Set Approach video that has been mailed and understood what is Test error and Training error and Validation set approach for Validation set approach we divide the given data sets into training set and validation while the validation set is used to estimate the model’s prediction error(test error).
For an instance:We have a dataset of 1,000 emails and now we split the data set into training set (70%) and validation set (30%) which means training set has 700 mails and validation set has 300 mails then we use a ML algorithm to train the data and make predictions now we compare the predictions to actual labels to calculate performance, lets assume our model has 92% accuracy and we are not satisfied with the accuracy we use other algorithms until the satisfactory accuracy is achieved, The test set provides a final estimate of the model’s prediction error on new, unseen emails.

Sept-22

I have learned about What is Cross Validation and its types?

Cross Validation : 

Cross-validation is a technique in machine learning.
It’s used to evaluate the performance of a predicted model.
The dataset is divided into subsets (folds).
The model is trained on some data folds and tested on other datafolds.
This process is repeated multiple times (k-fold) to assess performance robustly.
Cross-validation helps prevent overfitting and provides a better estimate of model generalization.

Types of Cross-validation:-

Leave-One-Out Cross-Validation (LOOCV):

N, the quantity of data points, is used to partition the dataset into N subsets. Thus one subset of the dataset is utilized for testing, while the other subsets are also used for testing.
One dataset is tested in each cycle while the others are used for training.
Each dataset serves as the test set once during the model’s N-time training and evaluation cycles.
It is helpful when you have little data or require a detailed assessment of the generality of your model. But it will perform less accurately when tested on a huge dataset.

k-Fold Cross-Validation:

The dataset is partitioned into k equal-sized subsets for k-fold cross-validation.
Each fold is utilized as the test set exactly once while the other k-1 folds are used for training. The model is trained and tested k times.
Let’s use an example where we divide a dataset of 12 records into three equal portions.

Data points 1, 2, 3, and 4 in Fold 1
Data points 5, 6, 7, and 8 in Fold 2
Data points 9, 10, 11, and 12 in Fold 3
1st iteration:
Training set: Folds 2 and 3 Testing set: Fold 1 The model is trained on the data points 5, 6, 7, 8, 9, 10, and 11, and tested on the data points 1, 2, and 3.

2nd iteration:
Fold 1 plus Fold 3 in the training set.
Set for testing: Fold 2
Data points 1, 2, 3, 4, 9, 10, and 12 are used to train the model, and data points 5, 6, and 8 are used to test it.

3rd iteration:
Exercise Set: Folds 1 and 2
Fold 3 is the test set.
Data points 1, 2, 3, 4, 5, 6, and 7 are used to train the model, and data points 9, 10, and 11 are used to test it.
You compute the average accuracy throughout the three iterations to provide an overall assessment of the model’s performance.

Stratified Cross-Validation:
This makes sure that the class distribution across all data folds is consistent.
It is advantageous for unbalanced datasets where one class has a disproportionately low number of examples.
To get more representative performance estimates for classification models, stratification is helpful.
Time Series Cross-Validation:
Time series data is a kind of archive of knowledge gathered over time that enables us to track changes in various phenomena.On the basis of time, we divided the data into training and testing segments. We begin by making predictions using a tiny amount of data. Then, we make future forecasts using those predictions along with our training data. Till all the data has been consumed, this process continues.

20-Sept

Underfitting

Underfitting is a common problem in machine learning where a model is too simplistic to adequately represent the underlying data patterns.

Underfitting occurs when a machine learning model is overly simplistic and fails to capture the intricate patterns within our data. This can lead to poor predictive performance due to a high bias, as the model oversimplifies relationships that are actually more complex.

In today’s class, we talked about some data related to crabs shedding their old shells. This shedding process is called molting, and it helps crabs grow. Our goal was to figure out how big a crab was before it molted based on how big it was after molting.

At first, we used a math method called a linear model, and it gave us a number called R^2, which was 0.98. We also looked at some numbers that describe the data. The numbers showed that the data was a bit unusual. When we made graphs of the data, they looked similar, just with a little shift in the middle. To check if they were really similar, we did a test called a T-test. This test helps us see if there’s a big difference between the two sets of data.

The T-test told us that there is a significant difference between the sizes before and after molting. We used a method called ‘Monte-Carlo’ to estimate how big this difference is.

18-Sept (Quadratic model)

Quadratic model

If the Independent variable has a parabolic shape on a nonindependent variable,  then the quadratic model is used.

Quadratic Model: In mathematics, a quadratic model or equation is a second-degree polynomial equation, typically written in the form of:

y=ax2+bx+c

In this equation:

  • y represents the dependent variable.
  • represents the independent variable.
  • , and c are constants, not equal to zero.

Quadratic models often describe parabolic relationships, where the data or phenomenon being studied follows a U-shaped or inverted U-shaped curve.

Overfitting

When using a quadratic model or any type of machine learning or statistical model, there’s a risk of overfitting, which is a common issue that can affect the model’s performance. Overfitting occurs when a model learns the training data too well, capturing not only the underlying patterns but also noise and random fluctuations in the data. This can lead to poor generalization to new, unseen data.

In my next blog, I’ll post about underfitting and cross-validation.

Kurtosis

Kurtosis

Kurtosis is a statistical measure that describes data distribution in a dataset. Kurtosis indicates whether the data are more or less extreme than a normal distribution.

There are typically three common types of kurtosis:

  1. Mesokurtic (Normal Kurtosis):If the kurtosis is near zero, the data distribution has tails similar to a normal distribution. In this case, the distribution is neither too peaked nor too flat.
  2. Leptokurtic (Positive Kurtosis):When the kurtosis is greater than zero,This means that the distribution has heavier tails and a more pronounced peak compared to a normal distribution
  3. Platykurtic (Negative Kurtosis):If the kurtosis is less than zero, This suggests that the distribution has lighter tails and is flatter compared to a normal distribution. In this case, the data have fewer extreme values or outliers.

Kurtosis is calculated using the fourth central moment of the data distribution and is typically defined as:

Where:

  • is the number of data points.
  • represents each data point.
  • is the mean (average) of the data.
  • is the standard deviation of the data.

Heteroscedasticity

Heteroscedasticity is a term used in regression analysis and statistics to describe a situation where the variability of the errors (residuals) in a regression model is not constant across all levels of the independent variables.

The Breusch-Pagan test is a statistical test used in regression analysis to check for the presence of heteroscedasticity in a regression model.
Specifically, you estimate a regression model where the dependent variable is the squared residuals from your original regression, and the independent variables are the variables you suspect might be related to heteroscedasticity. If the coefficients of these independent variables are statistically significant, it suggests the presence of heteroscedasticity. Then we perform a hypothesis test on the coefficients of the independent variables in the regression model. The null hypothesis typically states that there is no heteroscedasticity (i.e., the coefficients are equal to zero), and the alternative hypothesis is that there is heteroscedasticity.

 

Calculate the p-value: The test produces a p-value associated with the null hypothesis. If the p-value is smaller than a predetermined significance level (e.g., 0.05), you would reject the null hypothesis and conclude that heteroscedasticity is present.

If the Breusch-Pagan test indicates the presence of heteroscedasticity, you may need to address it by considering model adjustments, transformations, or robust regression techniques, as mentioned in the previous response.

9/11 Linear regression

Linear regression is a statistical technique for finding the relationship between a variable (Y) and (X)

It aims to find the best-fitting line.

The general form of a simple linear regression equation:
Y=b0+b1X+ε

Where:

  • Y is a variable (the one we want to predict).
  • X is a  variable (the one we use to make predictions).
  • is the intercept, which represents the predicted value of when is zero.
  • is the slope, which represents the change in for a one-unit change in .
  • represents the error term.

Observation from the Dataset

  • The Dataset provides information on Diabetes,  Obesity, and Inactivity of every state in the USA for the year 2018
  • Diabetes has 3142 samples.
  • Obesity has 363 samples.
  • Inactivity has 1370 samples.