In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. …

You’ve built your machine learning model — so what’s next? You need to evaluate it and validate how good (or bad) it is, so you can then decide on whether to implement it. That’s where the AUC-ROC curve comes in.

For now, just know that the AUC-ROC curve helps us visualize how well our machine learning classifier is performing. Although it works for only binary classification problems, we will see towards the end how we can extend it to evaluate multi-class classification problems too.

We’ll cover topics like sensitivity and specificity as well since these are key topics behind the…

** Definition** HR analytics is the process of collecting and analyzing Human Resource (HR) data in order to improve an organization’s workforce performance. The process can also be referred to as talent analytics, people analytics, or even workforce analytics. This method of data analysis takes data that is routinely collected by HR and correlates it to HR and organizational objectives. Doing so provides measured evidence of how HR initiatives are contributing to the organization’s goals and strategies.

Definition HR analytics is the process of collecting and analyzing Human Resource (HR) data in order to improve an organization’s workforce performance. The process…

You have just been hired as a Data Scientist at a Hospital with an alarming number of patients coming in reporting various cardiac symptoms. A cardiologist measures vitals & hands you this data to perform Data Analysis and predict whether certain patients have Heart Disease. We would like to make a Machine Learning algorithm where we can train our AI to learn & improve from experience. Thus, we would want to classify patients as either positive or negative for Heart Disease.

Predict whether a patient should be diagnosed with Heart Disease. This is a binary outcome.

Positive (+) = 1…

Most of the paramteric machine learning models like LDA, Linear Regression and mamy more assume that the data is normally distributed. If this assumption fails the model fails to give accurate predictions.

A probability distribution with the mean 0 and standard deviation of 1 is known as standard normal distribution or Gaussian distribution. A normal distibution is **symmetric** about the mean and follows a **bell shaped curve** . And almost 99.7% of the values lies within 3 standard deviation. The mean, median and mode of a normal distribution are equal.

Skewness of a distribution is defined as the lack of…

Oftentimes in practical machine learning problems there will be significant differences in the rarity of different classes of data being predicted. For example, when detecting cancer we can expect to have datasets with large numbers of false outcomes, and a relatively smaller number of true outcomes.

The overall performance of any model trained on such data will be constrained by its ability to predict rare points. In problems where these rare points are only equally important or perhaps less important than non-rare points, this constraint may only become significant in the later “tuning” stages of building the model. …

Outlier is an observation that is numerically distant from the rest of the data or in a simple word it is the value which is out of the range.let’s take an example to check what happens to a data set with and data set without outliers.

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 3.14. But with the outlier, average soars to 59.71. This would change the estimate completely.

Lets take a real world example. In a company of 50 employees, 45 people…

A passionate data scientist having knowledge in predictive modelling, data processing, and data mining algorithms to solve challenging business problems.