Mathematics in Data Science (2)
Woww! It’s been quite an interesting learning path as a beginner in Data Science. In my previous blogpost, a brief note was made on probability, combinatorics and Bayes’ theorem. The focus here would be strictly on probability distribution.
Probability Distribution
Probability distribution shows the collection of all possible values a variable can take and how frequently they occur.
Two (2) vital characteristics of a probability distribution are:
- Mean- the average value of a collection of variables.
- Variance- the measure of dispersion of a certain dataset; how spread out a data is.
Types of Probability Distribution
Based on the type of data we have, probability distribution is grouped into two:
- Discrete distribution
- Continuous distribution
The discrete distribution has a finite (limited) number of outcomes [for example, rolling a dice or picking a card out of a pile of cards], while the continuous distribution has an infinite (limitless) number of outcomes [for example, recording the blood pressure of adult male athletes in West Africa].
Discrete Distribution
As mentioned earlier, the discrete distribution has a finite number of outcomes. There are four types of the discrete distribution: uniform, Bernoulli, binomial, and Poisson distributions
Uniform distribution: Here, all outcomes of events are equiprobable (having equal likelihood of occurrence). An example of such event is rolling a standard 6-sided die. The probability of getting a 1 is same as that of getting a 2, 3, 4, 5 and even 6; that is, P(1)=P(2)=P(3)=P(4)=P(5)=P(6). Another example is making a choice out of 5 identical grey T-shirts. The probability of picking any of the shirts is always same.
Bernoulli distribution: Here, each events produce 2 outcomes with 1 trial (or iteration). A Bernoulli distribution is denoted as Bern(P). Examples of such events include: a coin flip (you’d get either a head or a tail) and quizzes with True/False answers. The outcomes gotten are assigned either 0 or 1.
Binomial distribution: This is a sequence of identical Bernoulli events. Hence, a Bernoulli distribution can be said to be a binomial distribution with a single trial. For example, taking a quiz and guessing the entire questions would be binomial, while guessing just 1 question in the quiz would be a Bernoulli distribution.
Poisson distribution: This shows the frequency with which an event occurs within a specific interval.
Continuous Distribution
Continuous distribution deals with “continuous” outcomes (infinite outcomes). They can’t be represented on tables (due to their “large sample size”), but on graphs. Its graph is a smooth curve (which could be called a Probability Distribution Curve or PDC), rather than bars (as in discrete distribution).
There are five (5) types of the continuous distribution: Normal distribution, Student’s-T distribution, Chi-Squared distribution, Exponential distribution and Logistic distribution.
Normal Distribution: Often observed in nature, hence called a normal distribution. An example of events that follow normal distribution is this: According to health researches, the normal blood pressure of a healthy adult is 120/80mmHg. However, some individuals might still have values as high as 125/85mmHg and as low as 110/75mmHg. When represented, these other values would fall at the extremities of our graph, and are called outliers. They occupy just a small percentage of our dataset.
The graph of a normal distribution is bell-shaped and symmetric (as regards to the mean). The outliers in the graph are represented by thin tails.
Student’s-T distribution: This is a small sample size approximation of a normal distribution. Its notation is t(k) where k represents the “degree of freedom” . The T distribution is applied in statistical analysis for hypothesis testing.
Chi-Squared distribution: This is an asymmetric distribution, and its curve is typically skewed to the right. It doesn’t often occur in real-life events. It is more featured in statistical analysis when doing hypothesis testing to determine the “goodness of fit” of categorical variables and for computing confidence intervals.
Exponential distribution: As the name implies, its variables are those in which their probability starts off high and later decreases. A typical real-life example would be this: When there is a breaking news or a trend on Twitter, at first there is a great interest upon the start of the trend and huge engagement on that supposed trend. Thereafter (maybe, within a few days), the engagement begins to decrease. As time goes on, engagement is random due to less relevancy.
Logistic distribution: Often used in mathematical modelling. It is used to determine how continuous variable inputs can affect the probability of a binary output. The center of the logistic distribution curve is the mean.