To limit Corona Virus spread, social distancing and observing hygiene standards like compulsory wearing of mask, use of hand gloves, face shield, and use of sanitizer is very important.

Many Organizations are making it compulsory to follow social distancing and wearing of face mask. This article, explains how to detect facial mask monitoring using OpenCV and Python.

I am assumig that you have basic knowledge of OpenCV and Python.

** There are two main steps: **

Identify human Face and Mouth in each frame of input video

Identify Person is using Mask or not

** Step 1: Identify Face and Mouth Steps for Face Prediction using Python and OpenCV are: **

Create HAAR Cascade object using CascadeClassifier function and haarcascade_frontalface_default.xml

Read image using function

**imread (or read for video/ camera input)**functionConvert in gray scale using

**cvtColor**functionDetect face using

**detectMultiScale**function

To get details how Open CV detect face refer link Face Recognition with OpenCV OpenCV 2.4.13.7 documentation

It has been observed that, person with white mask, most of time OpenCV cannot identify face correctly. To over come this difficulty, convert image in Black and White using threshold function and then send this image to **detectMultiScale** function.

Note: It is important to adjust threshold (bw_threshold) value in range

80 to 105threshold based on camera and surrounding light.

**Following code detects face and mouth of a person in an image:**

** Step 2: Identify Person is using Mask or not **

As shown in above code, there are three rectangle objects:

**Gray**image face rectangle.**Black & White**image face rectangle.**Gray**image Mouth rectangle.

Add following code to validate mask/ no-mask after **Detect Mouth** code.

** => When Person is wearing the Mask it will show as Mask Detected :- **

** => When Person is not Wearing Mask it will show as Mask not Detected :- **

Based on number of rectangles and rectangle position of mouth and face, we can create rule to detect mask. Following truth table will provide correct condition of with mask/ without mask.

You can download complete code and HAAR Cascade XML Files from GitHub.

That is all, I hope you liked the post. Feel Free to follow me.

]]>NumPy is a python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. It is an open source project and you can use it freely. NumPy stands for Numerical Python.

The most important object defined in NumPy is an N-dimensional array type called **ndarray.** It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index.Every item in an ndarray takes the same size of block in the memory.

Each element in ndarray is an object of data-type object (called **dtype**).Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.

The following diagram shows a relationship between ndarray, data type object (dtype) and array scalar type

It creates an ndarray from any object exposing array interface, or from any method that returns an array.

```
numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)
```

The above constructor takes the following parameters

**- Object :-** Any object exposing the array interface method returns an array, or any (nested)
sequence.

**- Dtype :-** Desired data type of array, optional.

**- Copy :-** Optional. By default (true), the object is copied.

**- Order :-** C (row major) or F (column major) or A (any) (default).

**- Subok :-** By default, returned array forced to be a base class array. If true, sub-classes passed
through.

**ndmin :-** Specifies minimum dimensions of resultant array.

In this blog, well walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, well try to figure out more about the perceived quality of wine.

The data was downloaded from the winequality-red.csv, and is available here. file, which well be using throughout this tutorial: Lists Of Lists for CSV Data Before using NumPy, well first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the csv file.

In the below code, we:

Import the

**csv**library.Open the winequality-red.csv file.

With the file open, create a new csv.reader object.

Pass in the keyword argument

**delimiter=";"**to make sure that the records are split up on the semicolon character instead of the default comma character.Call the list type to get all the rows from the file.

Assign the result to

**wines.**

```
import numpy as np
import csv
with open(r'winequality-red.csv') as f:
wines = list(csv.reader(f, delimiter=';'))
# Once weve read in the data, we can print out the first 3 rows:
print(wines[:3])
```

*We can check the number of rows and columns in our data using the shape property of NumPy arrays:*

```
import numpy as np
import csv
wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
print("\n")
# shape property
x=wines.shape
print(x)
```

Lets select the element at row **3 and column 4**. In the below code, we pass in the index **2 as the row index,** and the index **3 as the column index.** This retrieves the value from the fourth column of the third row:

```
import numpy as np
import csv
wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
print("\n")
# Indexing NumPy Array
y=wines[2,3]
print(y)
```

So far, weve worked with 2-dimensional arrays, such as wines. However, NumPy is a package for working with multidimensional arrays. One of the most common types of multidimensional arrays is the 1-dimensional array, or vector.

Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array. If we slice wines and only retrieve the third row, we get a 1-dimensional array:

We can retrieve individual elements from

**third_wine**using a single index. The below code will display the second item in**third_wine:**Most NumPy functions that weve worked with, such as numpy.random.rand, can be used with multidimensional arrays. Heres how wed use

*numpy.random.rand*to generate a random vector:

```
import numpy as np
import csv
wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
print("\n")
# slicing wines and only retrieve third row
third_wine = wines[3,:]
print('1. retrieve third row',third_wine)
print("\n")
# second item in third wine
x=third_wine[1]
print('2. second item in third row = ',x)
print("\n")
# to use numpy.random.rand to generate a random vector
y=np.random.rand(3)
print('3. random vector generated is = ',y)
```

**After successfully reading our dataset and learning about List, Indexing, & 1D array in NumPy we can start performing the operation on it.**

The first element of each row is the **fixed acidity,** the second is the **volatile ,acidity,** and so on. We can find the average **quality** of the wines. The below code will:

- Extract the last element from each row after the header row.
- Convert each extracted element to a float.
- Assign all the extracted elements to the list
**qualities.** - Divide the sum of all the elements in
**qualities**by the total number of elements in**qualities**to the get the mean

```
import numpy as np
import csv
with open(r'winequality-red.csv') as f:
wines = list(csv.reader(f, delimiter=';'))
print(wines[:3])
print("\n")
qualities = [float(item[-1]) for item in wines[1:]]
avg= sum(qualities) / len(qualities)
print('average = ',avg)
```

In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum method. This finds the sum of all the elements in an array by default:

** => 2. Sum of alcohol content in all sample red wines **

```
import numpy as np
import csv
wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
#task-1 sum of alcohol content in all sample red wines
total_alcohol=wines[:,11].sum()
print('sum = ',total_alcohol)
```

We get a Boolean array that tells us which of the wines have a quality rating greater than **5.** We can do something similar with the other operators. For instance, we can see if any wines have a quality rating equal to **10:**

** => 3. select wines having pH content > 5 **

```
import numpy as np
import csv
wines= np.genfromtxt(r'C:\Users\USer\Desktop\python\python_IA\winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
# select wines having pH content > 5
ph_levels=wines[:,9]>5
print(ph_levels)
print("\n")
```

We select only the rows where **high_Quality** contains a **True** value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria. For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses, and separate conditions with an ampersand (&):

** => 4. Select only wines where sulphates >10 and alcohol >7 **

```
import numpy as np
import csv
wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
# Task-4 select only wines where sulphates > 10 and alcohol > 7
sulalc=(wines[:,10]>10) & (wines[:,11]>7)
print(wines[sulalc,10:])
print("\n")
```

** => 5. select wine having pH greater than mean pH **

```
import numpy as np
import csv
wines= np.genfromtxt(r'winequality-red.csv',delimiter=";",skip_header=1)
print(wines)
# select wine having pH greater than mean pH
meanPH=(wines[:,9]>wines[:,9].mean())
print(wines[meanPH,9:])
```

** We have seen what NumPy is, and some of its most basic uses.** In the following posts we will see more complex functionalities and dig deeper into the workings of this fantastic library!

That is all, I hope you liked the post. Feel Free to follow me.

]]>*Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.*

The Data Structures provided by Pandas are of two distinct types

**1. Pandas DataFrames & **.**2. Pandas Series**

Well look at **Pandas Dataframes** in this post.

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the **data, rows, and columns.**

**Features of DataFrame**

- Potentially columns are of different types.
- Size Mutable.
- Labeled axes (rows and columns).
- Can Perform Arithmetic operations on rows and columns.

**Diffrence between Series and DataFrame**

You can think of it as an SQL table or a spreadsheet data representation.

A pandas DataFrame can be created using the following constructor

```
pandas.DataFrame( data, index, columns, dtype, copy)
```

The parameters of the constructor are as follows

**Data : ** data takes various forms like ndarray, series, map, lists, dict, constants and also another
data frame.

**index :** For the row labels, the Index to be used for the resulting frame is Optional Default
np.arange(n) if no index is passed.

**columns : ** For column labels, the optional default syntax is np.arange(n).this is only true if no
index is passed.

**dtype : ** Data type of each column.

**copy : ** This command (or whatever it is) is used for copying of data, if the default is False.

```
import pandas as pd
import numpy as np
df = pd.DataFrame()
print (df)
```

The first thing we should do, once we have downloaded or collected some data is to **read such data into a pandas DataFrame.**This is one of the main Pandas objects, along with the Series, and like I mentioned before, it resembles a table with columns and rows. Before, as always, we should import the library.

**Pandas also has functions for reading from Excel sheets, HTML documents, or SQL databases**(although there are other tools that are better for reading from databases)

```
import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
print(df)
df.head()
```

We can check out the first n rows of our dataframe using the **head method.** There is also a **tail method** to look at the last n. By default if no n is given to these methods they return the first 5 or last 5 instances.

```
import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
print(df)
df.head()
```

Using the head method without a parameter returns the following block:

After successfully reading our data and creating our dataframe, we can start getting some information out of it with two simple methods:

**info:**the info method returns the number of rows in the dataframe, the number of columns, the name of each column of the dataframe along with the number of non-null values of such column, and the data type of each column.

```
import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
print(df)
df.info()
```

2.**describe: ** the describe method returns some useful statistics about the numeric data in the
dataframe, like the mean, standard deviation, maximum and minimum values, and some percentiles.

```
import pandas as pd
df = pd.read_csv("Life Expectancy Data.csv")
print(df)
df.describe()
```

The next step after getting this global view of our data is **learning how to access specific records of our dataframe.** Like a python list, pandas dataframes can be sliced, using exactly the same notation as for the lists.

So if we want to select the first 10 rows of our dataframe, we could do something like:

With **loc and iloc** you can do practically any data selection operation on DataFrames you can think of -

** loc** is label-based, which means that you have to specify rows and columns based on their row
and column labels.

** iloc** is integer index based, so you have to specify rows and columns by their integer index.
After successfully understanding theory behind the **loc and iloc,** Lets get started to implement
it.

**1. Average Life Expectancy** over 15 Years in **Afghanistan.**

```
import pandas as pd
df = pd.read_csv("Life Expectancy.csv")
print(df)
country=df['Country'][1]
df = df.loc[df['Country'] == 'Afghanistan']
life_expectancy = df['Life expectancy '].mean() # name of the column data where avg to be performed
final_answer = [country, life_expectancy]
print(final_answer)
```

**2. Highest Life Expectancy** in **Developed Country** over 15 years

```
import pandas as pd
df = pd.read_csv("Life Expectancy.csv")
print(df)
ck = df.loc[df['Status'] == 'Developed']
finalAnswer = ck.loc[ck['Life expectancy '].max() == ck['Life expectancy ']]
print(finalAnswer)
```

**3. Maximum polio** in the **countries** over **15 years.**

```
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\USer\Desktop\program\python\EXP5\Life Expectancy Data.csv',usecols = ['Country','Year','Status','Polio'])
print(df)
x=df[df.Polio == df.Polio.max()]
final_answer = [x]
print(final_answer)
```

**4. Maximum percentage expenditure** of the **Developing country** over **15years.**

```
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\USer\Desktop\program\python\EXP5\Life Expectancy Data.csv',usecols = ['Country','Year','Status','percentage expenditure'])
print(df)
ck = df.loc[df['Status'] == 'Developing']
finalAnswer = ck.loc[ck['percentage expenditure'].max() == ck['percentage expenditure']]
print(finalAnswer)
```

**5. Lowest Adult Mortality** in a particular country over **15 years.**

```
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\USer\Desktop\program\python\EXP5\Life Expectancy Data.csv',usecols = ['Country','Year','Status','Adult Mortality'])
print(df)
finalAnswer= df.loc[df['Adult Mortality'].min() == df['Adult Mortality']]
print(finalAnswer)
```

Any **groupby** operation involves one of the following operations on the original object. They are

Splitting the Object

Applying a function

Combining the results

**6.** Showing data of all Country having **avg** total expenditure, sum of polio and standard deviation of life expectancy with the method called **groupby**

```
import pandas as pd
df = pd.read_csv("Life Expectancy.csv")
print(df)
ans=df.groupby('Country').agg({'percentage expenditure':'size','Polio':'sum','Total expenditure':'mean','Life expectancy ':'std'})
print(ans)
```

**We have seen what Pandas is, and some of its most basic uses.** In the following posts we will see more complex functionalities and dig deeper into the workings of this fantastic library!

That is all, I hope you liked the post. Feel Free to follow me.

]]>**Correlation **is a statistical measure that indicates the extent to which two or more variables fluctuate together. **Positive Correlation** indicates the extent to which those variable increase or decrease in parallel; **Negative Correlation** indicates the extent to which one variable increases as the other decreases.

- A correlation coefficient is a way to put a value to the relationship.
- Correlation Coefficient has a value between
**-1 and 1** - A
**0**means there is no relationship between the variables at all. - While
**-1 or 1**means that there is a perfect negative or positive correlation.

*It is a Statistical method To determine whether there is a relationship between two variables.*

Response variable and the explanatory variable are continuous variables (i.e. real numbers with decimal places things like heights, weights, volumes, or temperatures).

To predict the value of an outcome variable Y based on one or more input predictor variables X.

*To establish a linear relationship (a mathematical formula) between the predictor variable(s) and the
response variable, so that, we can use this formula to estimate the value of the response Y, when
only the predictors (Xs) values are known.*

1 is the intercept and 2 is the slope.

Collectively, they are called regression coefficients.

is the error term, the part of Y the regression model is unable to explain.

The assumptions for linear regression are:

**Linearity:** the relationship between X and Y is linear;

**Homoscedasticity:** the variance of residuals is the same for any given value of X;

**Independence:** all observations are independent of each other;

**Normality:** Y is normally distributed for any given value of X.

Focusing only on blood characteristics (leaving out the body size), the goal is building a simple regression model that can be use to predict ** hematocrit (hc)** by establishing a statistically significant linear relationship with **hemaglobin (hg).**

For the current analysis, AIS dataset coming from the CRAN package DAAG has been used: its a data frame with **202 observations and 13 variables.** It represents a study on a group of australian athletes to predict **hematocrit (hc) ** by establishing a statistically significant linear relationship with **hemaglobin (hg).**

below Im writing the code for displaying the datasets using R Libraries.

To predict hematocrit (hc) by establishing a statistically significant linear relationship with hemaglobin (hg).

**Scatter plot:**Visualize the linear relationship between the predictor and response.

```
qplot(HEMAGLOBIN, HEMATOCRIT, data = newdata,
main = "HEMAGLOBIN and HEMATOCRIT relationship") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(colour = "blue", size = 1.5) +
scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
scale_x_continuous(breaks = c(10:25), minor_breaks = NULL)
```

Plot shows there is a strong relationship between hemaglobins level and hematocrit percentage.

**2. Box Plot:** To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.

**Box Plot ( Check for outliers ) :**

Generally, any datapoint that lies outside the 1.5

interquartile-range (1.5IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

```
par(mfrow=c(1, 2)) # it divides graph area in two parts
boxplot(newdata$HEMAGLOBIN, col = "yellow", border="blue",
main = "HEMAGLOBIN boxplot",
ylab = "g per decaliter")
boxplot(newdata$HEMATOCRIT, col = "orange", border="blue",
main = "HEMATROCRIT boxplot",
ylab = "percent values")
```

**3. Histogram : ** It displays the probability distribution of the numerical data

```
# Histogram of HEMAGLOBIN
qplot(HEMAGLOBIN, data = newdata, geom="histogram", binwidth=0.5,
fill=I("azure4"), col=I("azure3")) +
labs(title = "HEMAGLOBIN") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x ="Concentration (in g per decaliter)") +
labs(y = "Frequency") +
scale_y_continuous(breaks = c(0,5,10,15,20,25,30,35,40,45,50), minor_breaks = NULL) +
scale_x_continuous(breaks = c(10:25), minor_breaks = NULL) +
geom_vline(xintercept = mean(newdata$HEMAGLOBIN), show_guide=TRUE, color
="red", labels="Average") +
geom_vline(xintercept = median(newdata$HEMAGLOBIN), show_guide=TRUE, color
="blue", labels="Median")
```

```
# Histogram of HEMATOCRIT
qplot(HEMATOCRIT, data = newdata, geom="histogram", binwidth=1,
fill=I("azure4"), col=I("azure3")) +
labs(title = "HEMATOCRIT") +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x ="percent values") +
labs(y = "Frequency") +
scale_y_continuous(breaks = c(0,5,10,15,20,25), minor_breaks = NULL) +
scale_x_continuous(breaks = c(30:65), minor_breaks = NULL) +
geom_vline(xintercept = mean(newdata$HEMATOCRIT), show_guide=TRUE, color
="red", labels="Average") +
geom_vline(xintercept = median(newdata$HEMATOCRIT), show_guide=TRUE, color
="blue", labels="Median")
```

** 4. Density Plot :** To see the distribution of the predictor variable. Ideally, a close to normal distribution *(a bell shaped curve)*, without being skewed to the left or right is preferred. Let us see how to make each one of them.

```
par(mfrow=c(1, 2)) # it divides graph area in two parts
plot(density(newdata$HEMAGLOBIN), main="Density: HEMAGLOBIN", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(newdata$HEMAGLOBIN), 2)))
polygon(density(newdata$HEMAGLOBIN), col="yellow")
plot(density(newdata$HEMATOCRIT), main="Density: HEMATOCRIT", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(newdata$HEMATOCRIT), 2)))
polygon(density(newdata$HEMATOCRIT), col="orange")
```

**Objective** now is building a linear model and see how well this model fits the observed data. In a
simplistic form, equation to solve is the following:

The function used for building linear models is lm().

The lm() function takes in two main arguments, namely:

- Formula
- Data.

The data is typically a data.frame and the formula is a object of class formula & The most common convention is to write out the formula directly in place of the argument.

** Hematocrit=0+1 HemaglobinHematocrit=0+1 Hemaglobin**

So, the intercept is the expected hematocrit value for the hemaglobin level and the slope is the increase of hematocrit associated with a one-unit increase in hemaglobin level.

```
# Show the relationship creating a regression line
qplot(HEMAGLOBIN, HEMATOCRIT, data = newdata,
main = "HEMAGLOBIN and HEMATOCRIT relationship") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(method="lm", col="red", size=1) +
geom_point(colour = "blue", size = 1.5) +
scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
scale_x_continuous(breaks = c(10:25), minor_breaks = NULL)
```

** Note:** Ideally, the regression line should be as close as possible to all data points observed. Smoothing is set to a confidence level of **0.95** (by default).

An additional and interesting possibility is to create a new variable named HEMAGLOBIN_CENT , that **Centers** the value of the variable HEMAGLOBIN on its mean: this is useful to give a meaningful interpretation of its intercept estimate (the average HEMAGLOBIN level is centered on value 0.0 on X-axis).

```
set.seed(123) # setting seed to reproduce results of random sampling
HEMAGLOBIN_CENT = scale(newdata$HEMAGLOBIN, center=TRUE, scale=FALSE) # center the variable
# Show the relationship with new variable centered, creating a regression line
qplot(HEMAGLOBIN_CENT, HEMATOCRIT, data = newdata,
main = "HEMAGLOBIN_CENT and HEMATOCRIT relationship") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(method="lm", col="red", size=1) +
geom_point(colour = "blue", size = 1.5) +
scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
scale_x_continuous(breaks = c(-2,-1.5,-1,-0.5,0,0.5,1,1.5,2,2.5,3,3.5,4), minor_breaks = NULL)
```

Summary statistics are very useful to interpret the key components of the linear model output.

```
mod1 = lm(HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata)
summary(mod1)
```

The model p-Value (bottom last line) and the p- Value of individual predictor variables (extreme right column under Coefficients).

a linear model to be statistically significant only when both these p-Values are less that the pre- determined statistical significance level, which is ideally ** 0.05 **.

This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variables p-Value, the more significant the variable.

In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).

A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance. So, higher the t-value, the better.t-statistic is used in a t-test in order to decide if support or reject the null hypothesis.

** Note: ** t-statistic is the estimated value of the parameter (coefficient/slope) divided by its standard error.

Then, this statistic is a measure of the likelihood that the actual value of the parameter is not zero. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance.

When p Value is less than significance level ** (< 0.05) **, we can safely reject the null hypothesis
that the co-efficient of the predictor is zero.

In our case, linear Mod, both these p-Values are well below the ** 0.05 ** threshold, so we can
conclude our model is indeed statistically significant.

```
modSummary <- summary(mod1) # capture model summary as an object
modCoeff <- modSummary$coefficients # model coefficients
beta.estimate <- modCoeff["HEMAGLOBIN_CENT", "Estimate"] # get beta coefficient estimate
std.error <- modCoeff["HEMAGLOBIN_CENT", "Std. Error"] # get standard error
t_value <- beta.estimate/std.error # calculate t statistic
print(t_value) # print t-value
```

The actual information in a data is the total variation it contains. R-Squared tells us is the proportion of variation in the dependent (response) variable. Note: We dont necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.

the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation.

Note: R-squared value tends to increase as more variables are included in the model. So, adjusted R-squared is the preferred measure as it adjusts for the number of variables considered.

Basically, F-test compares the model with zero predictor variables (the intercept only model), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients included in the model improve the models fit.
So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, ** F=102.3 is far greater than 1.**

```
f_statistic <- mod1$fstatistic[1] # calculate F statistic
f <- summary(mod1)$fstatistic # parameters for model p-value calculation
print(f) # print F value
```

Model summary as well as diagnostic plots have given important information that allow to improve the model fit. Together with mod1, it is possible to explore the mod2 that omits the noticed outliers.

**Note:** * of course, different models could be considered, i.e. including a quadratic term or adding one or more variables (or considering a new transformation of variables), but its out of the scope of the current document (it becomes a multiple regression problem).*

Blue points represent the three outliers indentified.

```
newdata2 <- subset(newdata1, OBS != 159 & OBS != 166 & OBS != 169,
select=c(HEMAGLOBIN, HEMATOCRIT))
HEMAGLOBIN_CENT = scale(newdata2$HEMAGLOBIN, center=TRUE, scale=FALSE) # center the variable
```

A new model is so given, and shows the following results:

```
mod2 = lm(HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata2)
summary(mod2)
```

Diagnostic plots are summarized in the graph below:

```
par(mfrow = c(2,2)) # display a unique layout for all graphs
plot(mod2)
```

The Akaikes information criterion AIC (Akaike, 1974) and the Bayesian information criterion BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

Note: For model comparison, the model with the lowest AIC and BIC score is preferred.

AIC(linearMod) # AIC => **419.1569** BIC(linearMod) # BIC => **424.8929**

So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this.

Follow the steps below which I am showing to Predict the linear Model.

```
set.seed(123) # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(newdata2), 0.7*nrow(newdata2)) # training and testing: 70/30 split
trainingData <- newdata2[trainingRowIndex, ] # training data
testData <- newdata2[-trainingRowIndex, ] # test data
```

####Step 2: Develop the model on the training data and use it to predict the distance on test data.

```
modTrain <- lm(HEMATOCRIT ~ HEMAGLOBIN, data=trainingData) # build the model
predict <- predict(modTrain, testData) # predicted values
summary(modTrain)
```

```
act_pred <- data.frame(cbind(actuals=testData$HEMATOCRIT, predicteds=predict)) # actuals_predicteds
cor(act_pred) # correlation_accuracy
head(act_pred, n=10)
# Actual values and predicted ones seem very close to each other. A good metric to see how much they are close is the min-max accuracy, that considers the average between the minimum and the maximum prediction.
min_max <- mean(apply(act_pred, 1, min) / apply(act_pred, 1, max))
print(min_max) # show the result
mape <- mean(abs((act_pred$predicteds - act_pred$actuals))/act_pred$actuals)
print(mape) # show the result
```

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the models performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is built on a different subset of training data and predicted on the remaining data.

```
# K-Cross Validation
kfold <- CVlm(data = newdata2, form.lm = formula(HEMATOCRIT ~ HEMAGLOBIN), m=5,
dots = FALSE, seed=123, legend.pos="topleft",
main="Cross Validation; k=5",
plotit=TRUE, printit=FALSE)
# The mean squared error measures how a regression line is close to a set of points
attr(kfold, 'ms')
```

The value of **0.13** is low, and it represents a good accuracy result. ( Ideally, smaller the means squared error is, closer is the line of best fit.)

An Introduction to Statistical Learning and Application in R , James G., Witten D., Hastie T., Tibshirani R., (Springer, 2013).

]]>