The history of probability theory is a fascinating journey that spans centuries and involves contributions from various cultures and thinkers. Here is a brief overview of key developments in the history of probability theory:
Ancient Roots (circa 300 BCE): The earliest roots of probability can be traced back to ancient civilizations. For instance, the ancient Greeks engaged in games of chance and rudimentary probability calculations. However, formalized probability theory did not emerge until much later.
Pascal and Fermat (17th Century): In the 17th century, two French mathematicians, Blaise Pascal and Pierre de Fermat, exchanged letters discussing problems related to gambling. This correspondence laid the foundation for the theory of probability. Pascal’s work, “Traité du Triangle Arithmétique,” and Fermat’s work on probability problems played a crucial role in shaping early probability concepts.
Huygens and the Mathematical Treatment (17th Century): Christiaan Huygens, a Dutch mathematician, expanded on the ideas of Pascal and Fermat. In his book “De Ratiociniis in Ludo Aleae” (On Reasoning in Games of Chance), Huygens introduced the concept of probability as a branch of mathematics. He developed the classical definition of probability, treating it mathematically and introducing the concept of expected value.
Jacob Bernoulli and Law of Large Numbers (18th Century): Swiss mathematician Jacob Bernoulli made significant contributions to probability theory in the early 18th century. His work, “Ars Conjectandi,” included the famous law of large numbers, which describes the convergence of sample averages to expected values as the number of trials increases.
Laplace and Bayesian Probability (18th-19th Century): Pierre-Simon Laplace, a French mathematician, extended probability theory in the late 18th and early 19th centuries. Laplace introduced Bayesian probability, emphasizing the use of prior knowledge in probability calculations. His work, “Théorie analytique des probabilités,” provided a systematic framework for probability theory.
Frequentist vs. Bayesian Debate (20th Century): The 20th century saw the development of two competing schools of thought in probability theory: frequentist and Bayesian. Frequentist probability, championed by statisticians like Ronald A. Fisher, focused on the frequency of events in the long run. Bayesian probability, influenced by the works of statisticians like Harold Jeffreys, integrated prior knowledge and subjective beliefs into probability calculations.
Modern Developments (20th Century Onward): Probability theory continued to evolve with contributions from various fields such as statistics, information theory, and machine learning. The development of Markov chains, stochastic processes, and the emergence of computational methods have further enriched the field.
Today, probability theory plays a crucial role in diverse areas such as statistics, finance, physics, and artificial intelligence, making it a fundamental aspect of modern mathematics and science.
]]>Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is
\[P(X^{t+1}|X^t,X^{t-1},...X^2,X^1) = P(X^{t+1}| X^t)\]Let us denote
\[p_{ij} = P(X^{n+1} = i | X^n = j)\]Where $[Math Processing Error]p_{ij}$ denotes the probability of going from state “j” to state “i” in one step, similarly we can define $[Math Processing Error]p_{ij}^n$ as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as
\[TPM = \begin{bmatrix} p_{11} \ p_{12} \ p_{13} \ . \ .\ \\ p_{21} \ p_{22} \ p_{23} \ . \ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \end{bmatrix}\]However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important
Let us assume that we want to estimate certain parameter $t(\theta)$ and the model is given such that $g(\theta)$ is prior density for $\theta$ and $f(y | \theta)$ is likelihood of $y = {y_1,y_2 ……y_n}$ give the value of $\theta$ then the posterior can be written as
\[g(\theta | y ) \propto g(\theta)f(y|\theta)\]Which have to be normalized , then the posterior density will be given by
\[g(\theta | y ) = \frac{g(\theta)f(y|\theta)}{\int g(\theta)f(y|\theta)d\theta}\]For the sake of simplicity let us assume $t(\theta) = \theta$ and let us assume $\hat{\theta}$ is an estimate, then take Square Error Loss Function
\[L(\theta , \hat{\theta}) = (\theta - \hat\theta)^2\]Then the Classical Risk will be given by $R_{\hat{\theta}}(\theta) = E_{\theta}(L(\theta,\hat{\theta}))$ and Bayes Risk is given by
\[r(\hat{\theta}) = \int R_{\hat{\theta}}(\theta)g(\theta)d\theta\]Now our target is to minimize bayes risk to get the bayes estimate
\[\begin{align*} r(\theta) &= \int R_{\hat{\theta}}(\theta)g(\theta)d\theta \\ &= \int E_{\theta}(L(\theta,\hat{\theta})) g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)dy\right)g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)g(\theta)dyd\theta\right) \\ &= \int \left( \int (\theta - \hat\theta)^2g(\theta|y)d\theta\right)f(y)dy \tag{1} \end{align*}\]The equation $({1})$ can be minimized if the inner integral is minimized, when
\[\hat{\theta} = E(\theta |y)\]Now we may not always able to calculate mean of posterior density, that means
\[\hat\theta = \int\theta g(\theta|y)d\theta\]That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate $\theta$ that is we take random samples from the kernel $g(\theta | y)$ i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as
\[X^1,X^2......X^n \ are \ samples \ from\ g(\theta|y) \ now \\ \frac{\sum X_i}{n} \to \hat\theta \ as \ n \to \infty\]Now let us take $[Math Processing Error]g(\theta | y) = \pi(\theta)$ it can be assumed because y is realized and $[Math Processing Error]g(\theta | y)$ is only function of $[Math Processing Error]\theta$ , Now comes the MCMC , if we can create a chain whose stationary distribution is $[Math Processing Error]\pi(\theta)$, then we can assume that chain as a random samples which converges to $[Math Processing Error]\pi(\theta)$ and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions
Let us denote $\pi$ as a probability measure on $(\mathcal{X},\mathcal{B})$ and $\Phi = {X^0,X^1 …}$ are discrete time Markov Chain on $(\mathcal{X},\mathcal{B})$ , let us assume transition kernel P and k as transition density and can be illustrated as
\[P(x,A) = Pr(X^{i+1} \in A | X^i = x ) = \int_A k(x,y)dy\]that is $P(x,A)$ gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators
where
\[\lambda P(A) = \int_{\mathcal{X}}\lambda(x)P(x,A)dx\]so if $X^i \sim \lambda$ then $\lambda P(A)$ is the marginal distribution of $X^{i+1}\in A$ and
\[Pf(x) = \int_{\mathcal{X}}P(x,dy)f(y) = E_p[f(X_{i+1})|X_i = x]\]and m-step transition probability is given by
\[P^m(x,A) = \int_A k^m(x,y)dy\]Invariant Density - $\pi$
\[\pi = \pi P \\ \Rightarrow \pi(x) = \int_{\mathcal{X}}\pi(y)k(y,x)dy\]
Now there are several way to ensure $\pi$ is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e
\[\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X}\]Proof
Suppose $\pi$ satisfy the balance condition then
\[\begin{align*} \pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X} \\ \\ \int_{\mathcal{X}}\pi(y)k(y,x)dy = \int_{\mathcal{X}}\pi(x)k(x,y)dy = \pi(x) \ \ \ \ \ \ \ \ \ \ \ \ \end{align*}\]However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for $\pi$ to be invariant, suppose $X^i \sim \pi$ and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to $\pi$ that is required for MCMC
Let us Define
$\phi$-irreducible A Markov Chain is for some measure $\phi$ on $\mathcal{X},\mathcal{B}$ if for all $x \in X$ and $A \in \mathcal{B}$ for which $\phi(A) > 0$ , there exist n for which $P^n(x,A)>0$
A Chain is Aperiodic if Period is 1
Harris Recurrent A $\phi$- irreducible Markov Chain is Harris Recurrent if a $\phi$ positive set A, the chain reaches set A with probability 1
Harris Ergodic A Markov Chain is said to be Harris ergodic if it is $\phi$ irreducible , aperiodic , Harris Recurrent and posses invariant distribution $\pi$ for some measure $\phi$ and $\pi$
Total Variation Distance The Total Variation distance between two measures $\mu(.) \ and \ v(.)$ is defined by
\[|| \mu(.) - v(.)|| = sup_{A \in \mathcal{B}}|\mu(A)-v(A)|\]The following two theorems are very important for MCMC
Ergodic Theorem A Markov chain $\Phi$ is Harris ergodic with Invariant Distribution $\pi$ and $E_{\pi} | g(X) | < \infty$ for some function $g : \mathcal{X} \to \Bbb{R}$ Then for any starting value $x \in \mathcal{X}$ , then
\[\bar{g}_n = \frac{1}{n}\sum_{i=0}^{n-1}g(X^i) \to E_{\pi}g(X) \ almost \ surely \ as \ n \ \to \infty\]and that is the main requirement that we use generally in MCMC
Birkhoff, George D. “Proof of the Ergodic Theorem.” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, 1931, pp. 656–660. JSTOR, www.jstor.org/stable/86016. Accessed 9 Apr. 2021.
The other Theorem is as follows
*Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e
\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n
The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as geometrically ergodic, i.e there exist $M:\mathcal{X} \to \Bbb{R}$ and some constant $t \in (0,1)$ that satisfy
\[||P^n(x,.)-\pi|| \leq M(x)t^n \ \ \ \ \ for \ any \ x \in \mathcal{X}\]If M is bounded , the Markov chain is uniformally ergodic
A Type 1 drift condition holds if there exist some non-negative function $V:\mathcal{X} \to \Bbb{R}_{\geq 0}$ and constant $0 < \gamma <1$ and $L < \infty$
\[PV(x) \leq \gamma V(x) + L \ \ \ \ \ \ \ \ \ \ \ \ for \ any \ x \in \mathcal{X}\]Further we call V a drift function and a $\gamma$ a drift rate
A Minorization condition holds on set $C \in \mathcal{B}$ if there exist some positive integer $m ,\epsilon > 0$ and probability measure Q in $(\mathcal{X},\mathcal{B})$ for which
\[P^m(x,A) \geq \epsilon Q(A)\]we can also call this m step minorization condition, here C is called small, It imply the following condition
\[k^m(x,y) \geq \epsilon q(A)\]Proposition
Suppose Markov chain $\Phi$ is irreducible and periodic with invariant distribution $\pi$ , Then $\Phi$ is geometrically ergodic if the following two conditions are met:
This Proposition is a Corollary of Rosenthal(1995a)
Let $\Phi$ be a a periodic and irreducible Markov chain with invariant distribution $\pi$
Let us suppose the Condition 1&2 of Proposition holds and $X^0 = x_0$ be the starting value and define
\[\alpha = \frac{1+d}{1+2L+\gamma d} \ \ \ \ \ \ and \ \ \ \ \ \ \ U = 1+2(\gamma d+L)\]Then for any $r \in (0,1)$
\[||P^n(x_0 ,.) - \pi(.)|| \leq (1-\epsilon)^{rn} +\left(\frac{U^r}{\alpha^{1-r}} \right)^n\left(1 + \frac{L}{1-\gamma} + V(x_0)\right)\]We can rearrange this to see that is satisfy geometric ergodicity condition
Type II Drift Condition : If there exist some function W : $\mathcal{X} \to [1,\infty)$ finite at some x $\in \mathcal{X}$, some set $D \in \mathcal{B}$ , and constants $0 < \rho < 1$ and $b < \infty$ for which
\[PW(x) \leq \rho W(x) + bI_D(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ for \ all \ x \in \mathcal{X}\]It is easy to show that Type I Drift Condition $\Leftarrow\Rightarrow$ Type II Drift Condition
Finally we can say that
Suppose Markov Chain $\Phi$ is aperiodic and $\phi-$irreducible with invariant distribution $\pi$. Then $\Phi$ is geometrically ergodic if there exist some small set D, the drift function $W: \mathcal{X} \to [1,\infty)$ and some constants $0 < \rho < 1$ and $b < \infty$ for which a type II drift conditions hold
Now Let me reinstate the earlier theorem
Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e
\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n
Jain and Jamison (1967) have shown that for every $\phi-irreducible$ Markov chain on $(\mathcal{X},\mathcal{B})$ . Then there exists some small set $C \in \mathcal{B}$ for which $\phi(C) > 0$.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0
the Jain and Jamison allow us to assume $C \in \mathcal{B}$ such that
\[P(x , A) \geq \epsilon Q(A) \ \ \ \ \ for \ all \ x \in C\]That is one step minorization condition , Now we can write
\[P(x,A) = \epsilon Q(A) + (1-\epsilon)R(x,A) \ \ \ \ \ \ \ for \ all \ x \in C \ and \ A \in \mathcal{B}\]Here $R(x,.)$ is probability measure for $(\mathcal{X},\mathcal{B})$ , then this allow us to construct two separate chain which couple with probability 1
\[\Phi(X) = \{X^0,X^1 ...........\} \\ \Phi(Y) = \{Y^0,Y^1 ............\}\]Now $(X^{n},Y^n) \to (X^{n+1},Y^{n+1})$ with the following algorithm
Now define coupling time T such that T denotes n for which first time $(X^{n-1},Y^{n-1}) \in C \times C$ and $\delta_{n-1}=1$ , once the chain couples it will remain equal
Now let us assume
\[X^0 = x \ and \ Y^0 \sim \pi\]And $Pr_x$ denotes the probability with respect to starting point x, then $\Phi(y)$ is stationary
\[\begin{align*} |P^n(x,A) - \pi(A)| &= |Pr_x(X^n \in A) - Pr_x(Y^n \in A)| \\ &= |Pr_x(X^n \in A,X^n = Y^n) +Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n = Y^n)| \\ &= |Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)| \\ &\leq max\{Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)\} \\ &\leq Pr_x(X^n \neq Y^n) \\ &= Pr_x(T > n) \end{align*}\]Thus
\[||P^n(x,.) - \pi(.)|| \leq Pr_x(T>n)\]Now Let us Suppose Minorization condition hold over entire space i.e $C = \mathcal{X}$ in this case every couple generated belongs to $C \times C$ for all n then
\[T \sim Geo(\epsilon) \\ P(T>n) = (1-\epsilon)^n\]so
\[||P^n(x,.) - \pi(.)|| \leq (1-\epsilon)^n\]so when C = $\mathcal{X}$ , $ | |P^n(x,.) - \pi(.) | | \to 0 \ as \ n \to \infty$
and When $C \neq \mathcal{X}$ , the distribution of $P(X>t)$ is complicated and beyond the scope of this presentation
Let us assume our Target Distribution is $\pi(\theta)$ such that $\theta = (\theta_1,\theta_2….\theta_d)$
Notation : $\theta_{-i}$ is vector of parameter except $\theta_i$
Initialization : $\theta^0 = (\theta_1^0,\theta_2^0……\theta_d^0)$
Iteration: For $i \geq 1$
The Transition Kernel for two parameter will be given by
\[k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2)) = \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)\]Let us check the stationarity for two parameter
\[\begin{align*} \int\int \pi(\theta_1,\theta_2)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2))d\theta_1d\theta_2 &= \int\int \pi(\theta_1,\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_1d\theta_2 \\ &= \int \pi(\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \int \pi(\tilde\theta_1,\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \pi(\tilde\theta_1)\cdot \pi(\tilde\theta_2|\tilde\theta_1) \\ &= \pi(\tilde\theta_2,\tilde\theta_1) \\ \end{align*}\]However this does not suffices for for the convergence, Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck If we are to prove the balance condition the we are assured that it will converge, Let $\Phi_i={\theta_i^0,\theta_i^1……..}$ and let $k_1(\tilde\theta_1,\theta_1)$ be the transition density in $\Phi_i$ , then
\[\begin{align*} \pi({\theta_1}) k_1(\tilde\theta_1,\theta_1) &= \pi({\theta_1})\int \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &=\pi({\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\tilde\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int {\pi(\tilde\theta_1|\theta_2)}\cdot {\pi(\tilde\theta_2|\theta_1)}d\theta_2\\ &= \pi({\tilde\theta_1}) k_1(\theta_1,\tilde\theta_1) \end{align*}\]Let us suppose
\[Y_1 , Y_2 ..... Ym \sim^{iid} N(\mu, \theta)\]where $m \geq 5$ , Let us assume the joint prior density as
\[g(\mu,\theta) \propto \frac{1}{\sqrt{\theta}}\]Let y = $(y_1,y_1 ……y_m)$ as a sample data with mean $\bar y$ and variance $s^2 = \sum(y_i - \bar y)^2$ the the posterior will be given by
\[g(\mu , \theta | y) \propto \theta^{-\frac{m+1}{2}}exp \bigg( -\frac{1}{2\theta} \sum_{j=1}^m (y_j - \mu)^2\bigg)\]and
\[\theta | \mu,y \sim IG\left(\frac{m-1}{2}, \frac{s^2+m(\mu -\bar{y})^2}{2}\right) \\ \mu | \theta ,y \sim N(\bar y,\frac{\theta}{m})\]We know Inverse Gamma have kernel $x^{-(a+1)}e^{-bx}$ with parameter (a,b)
Let us use DUGS Sampler in the following update scheme
\[(\theta^{'},\mu{'}) \to (\theta^{},\mu{'}) \to (\theta^{},\mu{})\]so the kernel density will be given by
\[k((\mu^{'},\theta^{'}),(\mu,\theta)) = \pi(\theta|\mu^{'},y)\pi(\mu|\theta,y)\]Type 1 Drift Condition
Let us define $V(\mu , \theta) = (\mu - \bar{y})^2$
\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E[V(\mu,\theta)|\mu^{'}] =E[E[V(\mu,\theta)|\theta]|\mu^{'}]\]where
\[E[V(\mu,\theta)|\theta] = E[(\mu-\bar{y})^2|\theta] = Var[\mu|\theta] = \frac{\theta}{m}\]Then
\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E\left[\frac\theta m | \mu^{'}\right] \\ \Rightarrow \frac{1}{m} \frac{s^2+m(\mu^{'}-\bar{y})^2}{m-3} \\ \Rightarrow \frac{(\mu^{'}-\bar{y})^2}{m-3} \frac{s^2}{m(m-3)} \\ \Rightarrow \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]now $m \geq 5$ guarantees that $\frac{1}{m-3} < 1$ hence
\[PV(\mu^{'},\theta^{'}) =E[V(\mu,\theta)|\mu^{'},\theta^{'}] \leq \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]So its satisfy drift condition with $\gamma \in (1/(m-3),1) $ and $L^2 =s^2/(m(m-3))$
Minorization Condition
Let us assume $C = {(\mu,\theta) : V(\mu,\theta) \leq d }$ for $d \geq 2L/(1-\gamma)$ if there exist density q and $\epsilon > 0$ for which
\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \epsilon q(\mu,\theta)\ for \ all \ (\mu^{'},\theta^{'}) \in C \ and \ (\mu, \theta) \in \Bbb{R} \times \Bbb{R}_+\] \[k((\mu^{'},\theta^{'}),(\mu, \theta)) = \pi(\mu|\theta,y)\pi(\theta | \mu^{'},y) \geq \pi(\mu|\theta,y) \inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y)\]Let us assume $IG(a,b ; x)$ denote the density at $ x>0$
\[g(\theta) =\inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y) \\ \Rightarrow IG\left(\frac{m-1}{2},\frac{s^2}{2}+\frac{m}{2}(\mu^{'}-\bar{y})^2;\theta\right) \\ \Rightarrow \left\{ \begin{array}{c} IG(\frac{m-1}{2},\frac{s^2}{2}+\frac{md}{2} ; \theta ) \ \ if \ \theta < \theta^* \\IG(\frac{m-1}{2},\frac{s^2}{2} ; \theta ) \ \ if \ \theta \geq \theta^*\\ \end{array} \right.\]where $\theta^{*} = md[(m-1)log(1+md/s^2)]^{-1}$
\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \pi(\mu | \theta,y)g(\theta) = \epsilon q(\mu,\theta)\]Where $q(\mu , \theta) = \epsilon^{-1}\pi(\mu | \theta,y)g(\theta)$
Hence the Minorization conditions hold
]]>Here we can think as a horizontal line in the posterior distribution, where it intersect the posterior density function such that the area under the intersection and posterior density is equal to 1-alpha.
Following is a my Class assignment during my masters in the fall of 2019.
Let us consider the following dataset follows an exponential distribution with scale parameter ${\theta}$.Let us consider the prior for ${\theta}$. Obtain posterior distribution, Bayes estimator, and 0.95 HPD interval for the parameter.
3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22, 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02
The density of the data model will be given by
\[f(x|\theta) = \frac{1}{\theta}e^{\frac{-x}{\theta}}\]Let us notify $\sum_{i=1}^n x_i =S_n$ now the likelihood will be given by
\[L(x|\theta) = \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}\]Now Since we do not have any info about $\theta$ let us assume non-informative prior
\[\pi{(\theta)} = \frac{1}{\theta}\]Then the posterior will be given by
\[\pi{(\theta|x)} = \frac{\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}{\int_0^{\infty}\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}\] \[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]Now this is the density of the Inverse Gamma so
\[\pi{(\theta | x)} \sim Inv-Gamma(n,S_n)\]So the bayes estimate will be given by $\frac{S_n}{n-1}$
xobs <- c(3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22,
2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02)
Bayes_Estimate = sum(xobs)/(length(xobs)-1) # Bayes Estimate
cat("Bayes Estimate of scale parameter is given by ",Bayes_Estimate)
## Bayes Estimate of scale parameter is given by 4.954737
Now HPDI will br given by
\[\int_{\theta : \pi(\theta|X) \geq k} \pi(\theta|X)d\theta = 1-\alpha\]where $1- \alpha = 0.95$ , here it can be thought as a horizontal line is on the posterior density such that the point where the posterior density intersect this line the area between these points will be 0.95
Let us take a look at posterior density function
s = sum(xobs)
l =length(xobs)
curve(dinvgamma(x , rate = s , shape = l),from=0,to=10)
Now let us find HPD , the posterior here is given by
\[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]ruler1 <- seq(2, s/(l+1),length=3500 ) #s\(l+1) is mode of posterior
ruler2 <- seq(s/(l+1), 8 ,length = 5000)
target = 0.95
tolerance = 0.0005
done<- FALSE
for(i in ruler1)
{
for(j in ruler2)
{
if(round(dinvgamma(i,rate=s,shape = l),3)==round(dinvgamma(j,rate=s,shape = l),3))
{
#print(paste(i,"and",j))
L <- pinvgamma(i,rate=s,shape=l)
H <- pinvgamma(j,rate=s,shape=l)
if (((H-L)<(target+tolerance)) & ((H-L)>(target-tolerance)))
{
done <- TRUE
break
}
}
}
if (done){break}
}
HPD.L <- i; HPD.U <- j
print(paste(target*100, "% HPD interval:", HPD.L, "to", HPD.U))
## [1] "95 % HPD interval: 2.94588413015964 to 7.2851736061498"
When we have dependent variable y is a qualitative, we can indicate it by indicator variable such as
\[y = 0\ \ \ if\ female \\ y = 1 \ \ \ if \ male\]So
\[y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip} + \epsilon_i \ \ \ \ \ \ i = 1,2,3,........,n\]or in the matrix form we can write
\[Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ . \\ . \\ y_n \\ \end{bmatrix} \ \ X = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & . &. & x_{1,p}\\ 1 & x_{2,1} & x_{2,2} & x_{2,3} & . &. & x_{2,p}\\ . & . & . & . & . & . & x_{3,p} \\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ 1 & x_{n,1} & x_{n,2} & x_{n,3} & . & . & x_{n,p}\\ \end{bmatrix} \ \ \beta = \begin{bmatrix} \beta{0} \\ \beta{2} \\ \beta{3} \\ . \\ . \\ \beta_p \\ \end{bmatrix} \epsilon = \begin{bmatrix} \epsilon{1} \\ \epsilon{2} \\ \epsilon{3} \\ . \\ . \\ . \\ \epsilon_n \\ \end{bmatrix}\]that is
\[Y = X\beta + \epsilon\]Remember first column of independent variable matrix X is $\underline{1}$ , for the constant $\beta_0$
Our dependent variable y , that we have to predict is indicator suppose it takes two values , assume y follows a bernoulli distribution
\[y_i = 1 \ with \ P(y_i = 1 ) = \pi_i \\ y_i = 0 \ with \ P(y_i = 0 ) = 1-\pi_i\]Assuming $E(\epsilon_i) = 0$,
\[E(y_i) = 1 \cdot \pi_i + 0 \cdot(1 - \pi_i) = \pi_i \\ E(y_i) = X\beta = \pi\]where
\[\pi = \begin{bmatrix} \pi_{1} & \pi_{2} & \pi_{3}& . & . \pi_{n}\\ \end{bmatrix}^{T}\]Now we know in Linear Regression $\epsilon$ is supposed to follow normal distribution , whereas here we cannot suppose $\epsilon$ to follow normal distribution, because here it take only two discrete values
so we have $E(y_i) =\pi_{i} = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip}$ where $E(y_i) \in [0,1]$ that put bound on the expected value of y
In logistic regression we use Standard logistic function , some people call it a Sigmoid function. It can be given by
\[E(y_i) = \pi_i = \frac{1}{1+e^{-(\beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip})}} \tag{1}\]Our main work in logistic regression our main aim is to predict $\pi$ , the bernoulli parameter for $Y$ , and generally we took decision by $\pi_i$ greater than 0.5 or less than 0.5
Usually every model have a link function which relates the linear predictor $ \eta_i $ to the mean response $ \mu_i $. First of all we have to understand what is linear predictor, it is a systematic component where $ \eta_i = E(y \vert x_i) $ ,So if $g( . )$ is a link function then
\[g(\mu_i ) = \eta_i \ \ or \mu_i =g^{-1}(\eta_i)\]In the Linear regression this link is a identity link , whereas in the logistic regression $ \mu_i = E(y_i) =\pi_{i} $ so the relation between $\pi_i$ and $\eta_i = E(y \vert x_i) = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip} $ is a logistic regression so
\[g(X\beta) = \pi\]We have similar equation $\eqref{1}$ we can use that to get link function
\[\pi = \frac{exp(X\beta)}{1+exp(X\beta)} \\ X\beta=\eta = ln(\frac{\pi}{1-\pi})\]where $\frac{\pi}{1-\pi}$ is odds and its log is known as log-odds ,this transformation is logit transformation.
It is very hard to estimate $\beta$ theoretically , so we choose gradient-descent algorithm for calculation of the parameter
]]>There are many more examples of machine learning, here we are going to discuss Supervised Machine Learning, There are two parts of data features and labels, features are the input for the model just like the size of tumors is if we put the size of tumors the model will tell us whether it is malign or benign, the prediction here whether malign or benign are the labels, there some types of data which does not contain labels such as a grouping of news which are related does not require any labels, but here in supervised learning, we are concerned with data labels, so loosely we can say the Machine Learning modeling with labels are known as supervised learning.
For further understanding, we are going to use iris datasets, which have 4 features Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and one target variable Species
This is a long dataset with labels virginica, setosa and Versicolor however we are representing only part of data so we can see in the target column we have the only setosa
The realization of the target variable is known as labels however most of the data scientists use them interchangeably. The predictor variable and feature are the same thing and also known as the independent variable, while the target variable is known as the dependent variable
Classification is a machine learning models which classify things , such as classifying mail is spam or not , or in the iris data classifying where the plant is virginica , setosa or versicolor is the classification.
First of all we gonna load our dataset using the following codes, which also imports pandas and numpy under their usual aliases.
from sklearn import datasets
import pandas as pd
import numpy as np
iris = datasets.load_iris()
type(iris)
sklearn.utils.Bunch
We can see that iris dataset is a bunch, bunch is a datatypes which have a key value pairs, we can look at the pairs using following code
print(iris.keys())
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
type(iris.data),type(iris.target),type(iris.target_names),type(iris.DESCR),type(iris.feature_names)
(numpy.ndarray, numpy.ndarray, numpy.ndarray, str, list)
We can see that iris.data and iris.target is numpy array , also target names is also an array , DESCR is string and features names is string, if we iris.data.shape
and iris.target.shape
we can see data has shape 150 rows and 4 columns and this is our features,we can take a look at our data by the command print(iris.data)
, similarly the shape of target variable have 150 rows and 1 columns as we expected and we can look at it using print(iris.target)
However our target variable is encoded where
It can be seen using iris.targets
and it is also described in iris.descr
, let us store iris.data in variable X and iris.target in y
X = iris.data
y = iris.target
Let us construct a dataframe from the X which have header as iris.feature_names and show how our dataframe actuaaly looks like using head()
method
df = pd.DataFrame(X , columns = iris.feature_names)
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
Now let us train our first model using the k-Nearest Neighbors (or kNN), it is quite simple, first suppose there are only two features in our dataset then we can plot each observation (that is a single row in a dataset ) simply on the 2D plane as a point where the first feature is on the x-axis and second feature on the y-axis, and suppose the color of the point is a label that can be red or blue, suppose we get a feature with know label on it only with two features now we can plot that point on the same 2D plane but we cannot determine the color of the point since it is not labeled, now we have to predict label suppose we take 3 nearest observation on the plane then it is kNN with k=3 now we have to take the majority vote of 3 nearest neighbors, 2 of them is blue so our prediction is blue, our prediction may change with change in k, suppose k=5 now out of the 5 nearest neighbors 3 are red and 2 are blue then we predict red
This algorithm can be extended to n features where n number of features is greater than 2, by plotting the points in an n-dimensional euclidean plane and then computing the nearest neighbors
In Scikit Learn there are two important methods .fit
that will be useful for training the model and .predict
to predict the label using a trained model, now to use kNN we have to import sklearn.neighbors from sklearn library using from import KNeighborsClassifier
and then we have to initialize it and set the value for k let set it to 5 using KNeighborsClassifier(n_neighbors=5)
then we will fit the data using .fit
method
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5) #Storing the model in varible knn_model
knn_model.fit(X,y) #Fitting ot training the model
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
Now we have trained our model and stored it into the variable knn_model , Now if we have given the sepal length, width, and petal length, we can predict the species let us predict for 4.6,3.8,3.7,0.9 as sepal length, sepal width, petal length and petal width using .predict method
knn_model.predict([[4.4,3.8,3.7,0.9]])
array([1])
Hence we can see we have predicted 1 which represents Versicolor similarly we can do a lot of prediction at once by creating a NumPy array and then passing it as an argument to the knn_model.predict(), we must take care that the number of columns is equal to the number of features that we have used to train the model, now let us see an example
array = np.array([[4.4,3.8,3.7,0.9],
[3.2,5.7,2.0,1.3],
[5.5,1.9,2.8,4.7],
[3.2,9.7,6.2,1.0]])
prediction=knn_model.predict(array)
print(prediction)
[1 0 2 2]
iris.target_names[prediction]
Now we can get decoded species name by passing the prediction to iris.target as an index
array(['versicolor', 'setosa', 'virginica', 'virginica'], dtype='<U10')
Now we have trained or model , now we must measure the performance of our model to get the idea of how good or how bad is our model,there are various metric to measure the performance such as Accuracy , Precision , F-Measure etc. but one of the question we have is to which data to use for calculating performance since the data used for training will give too optimistic metric , and may be good only for the data that we have used for training however our main target in machine learning models to train the data such that is predicts the labels for new data, so we need to calculatr our metric on the new data but that is not possible since new data will not be labeled , so a typical operating procedure for a datascientists to split the data into train and test sets where train set will be used for training and the test set will be used for testing and so on calulating the metric such as Accuracy and all we are going to use accuracy here that is equal to the total true prediction divided by toal number of prediction , suppose we 100 observation in test sets and out of them our model predicte 75 of them true , that means there are 75 prediction ehic are right and 25 are wrong so at las t we can say accuracy
To split the dataset, first of all, we will import train_test_split
from sklearn.model_selection
, now the method train_test_split() will take some arguments, the first argument will be feature data and the second will be labels and that will be train_test_split(X,y) however this method will work fine, but to increase the usability of method it can take more arguments such as
test_size = 0.2
Let us talk about the output of the train_test_split method it will give four arrays, the feature of the train set, the feature of the test set, labels of the train set and labels of the test sets, lets split our dataset, and train our model on the training set
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y)
knn_model.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
Now we will use the trained model to predict the labels of test set X_test
X_test_prediction = knn_model.predict(X_test)
X_test_prediction
array([2, 1, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 1, 2, 1, 0, 2, 0, 1, 0, 1, 2,
2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2])
Further we use .score
method to calculate Accuracy , this method will take arguments the test set and labels of the test sets
knn_model.score(X_test,y_test)
0.9736842105263158
Hence we can see that we have about 97% Accuracy. Here we have used k=5, but question what will happen if we increase k. kNN models create a decision boundary, which divides the whole euclidean space into different regions where the number of regions is the number of classes, in our example kNN will divide the 4-dimensional euclidean space(4 dimensional because there are four features) into 3 regions and any new data label will be decided upon in which region it falls, our question here is what happens it we increase k so as we increase our decision boundary will smoothen.
Photo credit: An Introduction to Statistical Learning with Applications in R (Available for FREE!!! )
As we can see for k=1 our decision boundary represented by black is too much fitted as k=100 we can see that our decision boundary is too much smoothed. So if k is large the decision boundary will be smoother hence a less complex model however for small k the decision boundary will less smooth and give a complex model, which will be more sensitive to the noise in the data, which may give a good prediction for training data but may fail on new data, this is also known as overfitting if we increase k too much the decision boundary will be too much smoothed (tend to become straight line) and may not perform well on both of the test and train set as can see in the figure for k=100 and this is commonly known as underfitting so we must choose k such that neither it is under fitted nor overfitted that means choose k neither too large neither too small, for k=10 we will get following
Photo credit: An Introduction to Statistical Learning with Applications in R (Available for FREE!!! )
Accuracy is not always a good metric for measuring the performance of classification problems, suppose we have data for transactions from a bank and we have to create a model which classify whether a transaction is a fraud or not fraud, usually a lot of transactions are non-fraudulent let us say 95% are not fraudulent, this type of data is known as imbalanced data when one of the class is too frequent and for imbalanced data out accuracy metric does not perform well for imbalanced data, so there are other metrics to measure the performance of a model and they can be obtained from a very famous matrix known as Confusion Matrix
In Binary Classification there are two classes Positive and Negative, we call those classes positive class which we are interested in, suppose we want to model a transaction fraud then we are interested in the transactions which are fraud, then the class fraud is positive class and non-fraud class is negative, Various Metric can be calculated by By the Following Formulas
F1- Score can also be interpreted as Harmonic Mean of Precision and Recall , and given by
\[F1 \ Score = 2 \cdot \frac{precision \cdot recall}{precision + recall}\]Confusion Matrix can be calculated
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,X_test_prediction))
[[13 0 0]
[ 0 13 0]
[ 0 1 11]]
Here we got $3\times 3 $ matrix because we have 3 classes , we are not limited to only two class positive negative, here we have three classes of labels i.e ‘versicolor’, ‘setosa’, ‘virginica’, now to get the performance metrics we have to run the following codes
from sklearn.metrics import classification_report
print(classification_report(y_test,X_test_prediction))
precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.93 1.00 0.96 13
2 1.00 0.92 0.96 12
accuracy 0.97 38
macro avg 0.98 0.97 0.97 38
weighted avg 0.98 0.97 0.97 38
In regressions target variable is a continuous variable as price of a mobile,temperature and etc. To get started let us took diabetese dataset , which is already persent in sklearn module
from sklearn import datasets
import pandas as pd
import numpy as np
boston = datasets.load_boston()
Let us take a look at what we have imported in data variable using .keys()
attribute
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Now we have data and the feature names , so we can create a dataframe from the data
and feature names and can take a look at the from the head
method, as we have done in Classification
X = boston.data
y = boston.target
df = pd.DataFrame(X , columns = boston.feature_names)
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
Before Training the model let us split our data , we can not use stratify attribute here because our target varible is not categorical.
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 )
When we assume the target variable y is a linear function of columns of X, or we can say linear functions of features the model is known as linear regression, it can be represented as
\[\hat{y}_i = \sum_{i=0}^p a_{i}x^{i}\]Linear regression is an equation of line,and $a_i$ are known as parameters of linear regression.
Now our main aim is to set $ a_i $ such as the predicted value of y generally represented by is nearest to the actual value of y, to measure the amount of difference between the predicted and actual we use loss functions, these are a special type of functions which give 0 when the predicted value for the label is equal to the actual label, one of the most common loss function is squared error loss function given by
\[Loss(\hat{y} ;y)= \sum_{i=0}^n(y_i - \hat{y}_i)^{2}\]So our problem is to reduce loss, so to reduce we have to set optimized parameters which reduce loss, but for this type of loss the Estimation of the parameters that are $a_i$ are known as a least square estimate, for a different type of loss functions we can get different types estimate,but least square estimates are most used so we are gonna discuss this
p is the number of features , hence there will be (p+1) parameters , where we added 1 due to the fact that we have to also estimate the constant term $a_0$ and “n” is the number of observations , or we can say number of rows in the dataset
Now to fit the model , we will run the following code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now we can predict using .predict
method as follows
prediction=model.predict(X_test)
As we have seen the metric to measure the performance of a model is Accuracy in the classification section however for regression we can not use Accuracy one of the mostly used metric for regression is $R^2$ which is defined as proportion of variability in Y that can be explained using X , it can be calculated by following code
model.score(X_test,y_test)
0.736994702163782
Generally $R^2$ range vary from 0 to 1. When $R^2$ is near to 1 it represent that the model is good and when it is near to 0 the model fitted is not good
Cross-Validation is a method that reduces our dependency on how the data splits in train and test, there may be, only by chance that our performance metrics are representing our model as good, this is due to the fact we do not use all the data to calculate performance metrics. To eradicate the dependency on only one train test splits data we use Cross-Validation, or we can say k-fold Cross-validation where is k is a parameter and a positive integer suppose k=5 means there are 5-fold cross-validation, it simply divide our observations in our dataset into 5 groups commonly known as a fold, then we hold the first fold as a test set and all other folds are merged to create train set and then we calculate the performance metric we are interested in, and then do the same again by holding second fold as a test set and remaining as train set and after that calculate performance metric, this is known as a performance metric for the second split, similarly in k fold cross validation we calculate performance metric k times for k splits and further after calculating metric for every split we can calculate statistics of our interest such as the mean of these k performance metrics or mode, median or whatever statistic we want
k-Fold Cross Validation is computationally expensive , since we have to do the whole process of training, prediction and metric calculation k times , following is the way to do so
cross_val_score
cross_val_score
with arguments the model , features array , labels , number of fold suppose for 5 fold cv=5
, and store it in a variablenp.mean()
for meanfrom sklearn.model_selection import cross_val_score #Importing class
cross_validation_result = cross_val_score(model,X,y,cv=5) #Initializing
np.mean(cross_validation_result)
0.3532759243958772
Shrinkage is also known as Regularization, In general, we estimate the parameters , but sometimes they are two large and lead to higher variance so it is advisable to shrink the parameters toward 0, it can be done in various ways two of the famous one is Ridge Regression and Lasso
For Ridge Regression we just edit our general Loss function as following
\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p a_i^2\]Where $\alpha \geq 0$ is a tuning parameter and $\alpha \sum_{i=1}^p a_i^2$ is known as shrinkage penalty, here we must note that we have not the term for in the shrinkage penalty, unlike Least Square Estimate here we get different sets of parameters for different value of tuning parameter, however for tuning parameter equal to zero will lead to Least Square Estimate and may have a greater chance of overfitting, and a very large tuning parameter will penalize the parameters too much which can lead to underfitting so we have to choose tuning parameter such as it optimizes our model
To do Ridge Regression
Ridge
from the module sklearn.linear_modelRidge()
class , with passing the tuning parameter to alpha
argumentfrom sklearn.linear_model import Ridge
ridge_model = Ridge(alpha= 0.9 )
ridge_model.fit(X_train,y_train)
ridge_model.score(X_test,y_test)
0.7345197081669743
Ridge Regression has a demerit that it shrinks the parameters towards 0, but never set the parameters equal to 0, there may be some features which don’t explain any variance in the label that coefficient needs to be set equal to zero, to increase the model interpretation. For Lasso, we just add modulus of the parameters at the place of the square of parameters as in Loss of Ridge Regression
\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p |{a_i}|\]Lasso shrinks the coefficient of feature to 0 for the features which are less important
Lasso Regression have similar codes scrips as ridge Regression
from sklearn.linear_model import Lasso
lasso_model = Ridge(alpha= 10)
lasso_model.fit(X_train,y_train)
lasso_model.score(X_test,y_test)
0.7257977554026047
Logistic Regression, despite its a regression it is used in classification problem mostly, it finds out the probability that a given observation belongs to a particular class if it is greater than 0.5 or we can say 50%, then our model predict the observation label belong to that class, It estimates the probability using the following function
\[p = \sigma\left(\sum_{i=0}^p a_ix^i\right)= \frac{1}{1+ e^{-\sum_{i=0}^p a_ix^i}}\]But we will not go in theory too much, and focus on practical use. To Use Logistic Regression, it is similar to the work we have done earlier, import function, import data, split data, then test you, models, using performance metrics Let us do that, let us do this on breast cancer data, that is already available in sklearn module
from sklearn import datasets
import pandas as pd
import numpy as np
bcancer = datasets.load_breast_cancer() #Loading Data
from sklearn.linear_model import LogisticRegression #Importic class for logistic regression
LogReg_MODEL = LogisticRegression() #Initializing Logistic Regression class
X = bcancer.data
y = bcancer.target
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y) #Splitting Data
LogReg_MODEL.fit(X_train,y_train) #Training the model
AccuracyLogReg = LogReg_MODEL.score(X_test,y_test)
print(AccuracyLogReg)
0.958041958041958
ROC Curve is short form for receiver operating characteristic curve.
Threshold
We generally take threshold 0.5 that means in kNN when the number of a particular class label is greater than 0.5 of the total class label we predict it belongs to that class label , suppose we fitted kNN for k=100 , and we have two class label red and blue then we will predict red for an observation if more than 50 of the neighbors are red that 50 is the threshold number , that is number of red neighbors to classify it as red , that 50 is 0.5$\times$ 100 , so here we have threshold 0.5 , similarly in logistic regression p=0.5 is threshold in general
True Positive Rate and False Positive Rate (TPR and FPR)
True Positive Rate is also known as Recall and false positive rate is given by
\[FPR = \frac{FP}{FP+TN}\]Model always do not perform well when the threshold is 0.5 sometimes , model performs better with threshold other than 0.5 ,to know that we use ROC curve , ROC curve is a graph between TPR and FPR and for different threshold we get different ROC curve
To know how good is or model we use the area under curve (AUC) as a performance metric for ROC curve, lets say we have a perfect classifying model then TPR will be equal to 1 and FPR will be equal to 0 , this will be when area under the curve equal to 1 , so we can use AUC ROC as a performance metrics
Now to create ROC curve we have to do the following
roc_curve
from sklearn.metricsroc_curve()
function with following two arguments
y_true array, shape = [n_samples]
True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.
y_score array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Now to calculate y_score we will use probability estimates , that we can get by useing .predict_proba()
method on the test set , it will give output an array with two columns , first column is estimate and second column is probability , that is our y_score ,to get that we will subset that and take only second column by [:,1]
roc_curve
will have three output , we will store those variable , FPR , TPR and thresholds.plot
to plot ROC curvefrom sklearn.metrics import roc_curve
y_score = LogReg_MODEL.predict_proba(X_test)
y_score = y_score[:,1] #Subsetting only first column
fpr, tpr, thresholds = roc_curve(y_test, y_score)
# Now to plot the ROC curve
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, linewidth=1)
plt.plot([0, 1], [0, 1], 'k--') # to plot the dashed diagonal
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate or Recall')
plt.show() # to show the plot
Now we want performance metrics for model , and that is AUC, to calculate auc we just need to import roc_auc_score
and pass the same as we passed to the roc_curve
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_score))
0.9903563941299791
Hyperparameters are the parmeters of the learning algorithm model , as for the value k in k-Nearest Neighbor model is hyperparemeter or tuning parameter in ridge and lasso regression etc. For finer model we have to tune hyperparameters to the best setting.There are not any cut and clear to go for to do hyperparameter tuning. One of the philosphy is to randomly select hyperparameters and train and test and choose one which is better. Manually fidding hyperparameters then doing the whole lot of training and testing is a tedious job to do , so Scikit Learn have a GridSearchCV to help us.
GridSearchCV uses cross validation so that the a hyerparameter selection is not effected by train test split.The class GridSearchCV takes the following attribute
param_grid
a dictionary or a list of dictionary , this is the manual values of the hyperparameters we want to feed incv
number of folds for cross validationLet us tune the tuning parameter for Ridge regression model, using boston dataset
from sklearn.model_selection import GridSearchCV
knn_model = KNeighborsClassifier()
param_grid = {'n_neighbors' : [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100]}
X = iris.data
y = iris.target
knn_modelGridSearch=GridSearchCV(knn_model,param_grid,cv=10)
knn_modelGridSearch.fit(X,y)
print(knn_modelGridSearch.best_score_ , ridge_modelGridSearch.best_params_)
0.9800000000000001 {'n_neighbors': 6}
Here we can see, we get best results with 6 nearest neighbors.
Now here it ends, Happy Learning
]]>To get started we have download python anaconda version , that will automatically install jupyter notebook and then install Tensorflow to so this read our article here
You have to type Shift+Enter to run a cell in jupyter notebook
We will use MNIST dataset , which is developed by Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York and Christopher J.C. Burges, Microsoft Research, Redmond , in the dataset there are 60,000 images of handwritten digits and labeled them for training, and 10,000 images for testing. MNIST dataset is already divided in test and train set so we do not have to take care of that .
First of all we have to import tensorflow , with an alias tf
to use tensorflow
import tensorflow as tf
Now we will import dataset and store it
x_train
the trainig imagesy_train
label of the training imagesx_test
testing imagesy_test
label of the testing imagesmnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Now let us take a look at first image of the handwritten digit
from matplotlib import pyplot
pyplot.imshow(x_train[0])
pyplot.show()
Now we will divide x_train
and x_test
by 255, because our image is RGB so each pixel in our image can take any value between 0 to 255 , and neural networks works fine with range from 0 to 1 , so to normalize our dataset in 0 to 1 we divide both train and tes dataset by 255
x_train, x_test = x_train / 255.0, x_test / 255.0
Now we need to create a model we will use tf.keras.models.Sequential
to create a model and we will use four layers in it from the module tf.keras.layers
, layers are as follows
tf.keras.layers.Flatten
: it will flatten our data , our image is 28 $\times$ 28 pixels , it will flatten the image and convert it into 784 $\times$ 1 , it will take argument input_shape
which will be a tuple that define the shape of our input data
tf.keras.layers.Dense
: it is just a layer with units here 128, and a activation function here relu
tf.keras.layers.Dropout
: This layer drop input with a probability of rate
(here 0.2) and multiply each non dropped input by $\frac{1}{1-rate}$tf.keras.layers.Dense
: it is similar to the second layertf.keras.layers.Softmax
: it is used because output of the dense layer will be log-odds , softmax function maps logodds to probabilitiesmodel = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
tf.keras.layers.Softmax()
])
So for a input i.e 28 $\times$ 28 image , in the model, it gives us an array of 10 floating point number that will be output of the last dense layer.Let us get a output without training the model
predictions = model(x_train[:1]).numpy()
predictions
££ array([[0.09967916, 0.09987953, 0.09993076, 0.10024416, 0.10007039,
££ 0.10004147, 0.10008495, 0.09998867, 0.10009976, 0.09998112]],
££ dtype=float32)
Now we can check our model using model.summary()
model.summary()
££ Model: "sequential_5"
££ _________________________________________________________________
££ Layer (type) Output Shape Param #
££ =================================================================
££ flatten_7 (Flatten) (None, 784) 0
££ _________________________________________________________________
££ dense_13 (Dense) (None, 128) 100480
££ _________________________________________________________________
££ dropout_7 (Dropout) (None, 128) 0
££ _________________________________________________________________
££ dense_14 (Dense) (None, 10) 1290
££ _________________________________________________________________
££ softmax_1 (Softmax) (None, 10) 0
££ =================================================================
££ Total params: 101,770
££ Trainable params: 101,770
££ Non-trainable params: 0
££ _________________________________________________________________
Now we will define a loss function , we should choose loss function such as if our model predict wrong label our loss will , tensorflow have a lots of inbuilt loss function , here we will use Sparse Categorical Cross entropy, for in detailed description check here
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
Now our main target is to set trainable params
such that we get minimum loss , and we are using accuracy to measure the performance it can be calculated by dividing true class prediction by total predictions , we will use adam
as a optimizer
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
Now let us train our model
model.fit(x_train, y_train, epochs=5)
££ Epoch 1/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1909 - accuracy: 0.9445
££ Epoch 2/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1854 - accuracy: 0.9465
££ Epoch 3/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1821 - accuracy: 0.9471
££ Epoch 4/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1773 - accuracy: 0.9493
££ Epoch 5/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1741 - accuracy: 0.9498
Now let us evaluate our model on test set
model.evaluate(x_test, y_test, verbose=2)
££ 313/313 - 1s - loss: 0.1480 - accuracy: 0.9569
££ [0.14800186455249786, 0.9569000005722046]
As we can see we have approx ~95% accuracy
Here we have created a model , trained it , make predictions on it.
Happy Learning
]]>Prior density is denoted by $g(.)$ in this article
Non-Informative Priors are the priors which we assume when we do not have any belief about the parameter let say $ \theta $ . This leads noninformative priors to not favor any value of $ \theta $ , which gives equal weights to every value that belongs to $\Theta$. for example let us we have three hypothesis , so the prior which attach weight of $ \frac{1}{3}$ to each of the hypothesis is noninformative prior.
Note : most of the noninformative priors are improper.
Now let us assume a simple example let us assume our parameter space $\Theta$ is a finite set containing n elements such as
\[{\theta_1,\theta_2,\theta_3,\theta_4....\theta_n} \ \in \ \Theta\]Now the obvious weight given to each $\theta_i$ when we have not any prior beliefs is $\frac{1}{n}$ that gives us prior is proportional to a constant because $\frac{1}{n}$ is a constant let us say $\frac{1}{n}$=c hence we can say
\[g(\theta) = c\]Now let us assume a transformation $\eta=e^{\theta} $ , that is $\theta = log(\eta)$ . If $ g(\theta)$ is the density of $\theta$ then we can write density of $\eta$ as
\[g^*(\eta)=g(\theta)\frac{d\theta}{d\eta} \\ g^*(\eta)=g(log \ \eta)\frac{d \ log \ \eta }{d\eta} \\ g^*(\eta)=\frac{g(log \ \eta)}{\eta} \\ g^*(\eta) \propto \frac{1}{\eta}\]Thus if we choose prior for $\theta$ as constant , then we have to assume prior for $\eta$ as proportional to $\eta^{-1}$ to arrive at the same answer in both cases either we take $\theta $ or $\eta$ . Thus we cannot maintain consistency and assume both prior proportional to constant . This leads to the search of such noninformative priors which are invariant under transformations.
A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a function of $(x - \theta)$
Let X is a random variable with location parameter $\theta$ then density can be written as $h(x- \theta)$. Just assume instead of observing X we observed Y = X+c and let us take $\eta=\theta+c$ then can see that the density of Y is given by $h(y - \eta)$. Now $(X,\theta) \ and (Y,\eta)$ have same parameter and sample space which gives us the idea that they must have same noninformative prior
Let $g$ and $g^*$ are noninformative priors for $(X,\theta) \ and (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors , let us assume a subset of real line A
\[P^g(\theta \ \in \ A ) = P^{g^*}(\eta \ \in \ A )\]Now we have assumed $\eta=\theta+c$ so
\[P^{g^*}(\eta \ \in \ A )=P^{g}(\theta +c \ \ \in \ A )=P^{g}(\theta \ \in \ A-c )\]which leads us to
\[P^{g}(\theta \ \in \ A)=P^{g^*}(\theta \ \in \ A-c ) \tag{*}\\ \int_Ag(\theta)d\theta=\int_{A-c}g(\theta)d\theta=\int_Ag(\theta-c)d\theta\]It holds for any set A of real line , and any c on real line so it lead us to
\[g(\theta)=g(\theta-c)\]Now if we take $\theta=c$ we get $g(c)=g(0)$ ,and we know it is true for all c , it leads us to the conclusion that the prior in the case of location parameter is constant functions , for simplicity most of the statistician assume it equal to 1 , $g(.) = 1$
A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a $\frac{1}{\theta}h(\frac{x}{\theta})$ where $\theta>0$
For example in normal distribution we $N(\mu,\sigma^2)$ , $\sigma$ is a scale parameter .
To get noninformative prior for Scale Parameter $\theta$ of a random variable X , instead of observing X we observe $Y = cX$ for any $c > 0 $ , let us define $\eta = c\sigma$ , so then the density of $Y $ is given by $\frac{1}{\eta}f(\frac{1}{\eta})$ .
Now similar to previous part here $(X,\theta)$ and $(Y,\eta)$ have same sample and parameter space , so both will have same noninformative priors. Let $g$ and $g^*$ are noninformative priors for $(X,\theta) \ and (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors
\[P^g(\theta \in A)= P^{g^*}(\theta \in A)\]Here A is a subset of Positive real line, i.e $A \subset R^+$ , now putting $\eta = c\sigma$
\[P^{g^*}(\eta \in A) = P^g(\theta \in \frac{A}{c}) \\ P^g(\theta \in A) = P^g(\theta \in \frac{A}{c}) \\ \int_Ag(\theta)d\theta=\int_{\frac{A}{c}}g(\theta)d\theta=\int_A\frac{1}{c}g(\frac{\theta}{c})d\theta\]so
\[g(\theta)=\frac{1}{c}g(\frac{\theta}{c})\]Now taking $\theta=c$ , we get
\[g(c)=\frac{1}{c}g(1)\]Now this equation is true for any value $c>0$ so , for convenience taking $g(c)=1$ , it gives us noninformative prior $g(\theta)= \frac{1}{\theta}$
Note : It is an improper prior , $\int_0^{\infty}\frac{1}{\theta}d\theta = \infty $
Now we know noninformative prior for both Scale and Location parameter, but there is flaw . The prior we get for location and scale parameter in previous part are improper priors . If two random variables have identical form , then they have same non informative priors . but the problem here is due to improper priors , noninformative priors are not unique. lets say we have an improper prior g then if we multiply g by any constant k then the resultant gk will give same bayesian decisions as g.
Now in previous parts we have assumed two priors $g$ and $g^* $ , but we do not need that , we can get $g^*$ by just multiplying $g$ by a constant and vice-versa.
Now equation $(*)$ can be written as
\[P^g(A)=l(k)P^{g}(A-c)\]Where $l(k)$ is some positive function ,
\[\int_Ag(\theta)d\theta=l(k)\int_{A-c}g(\theta)d\theta=l(k)\int_Ag(\theta-c)d\theta\]It holds for all A , so $g(\theta)=l(k)g(\theta-c)$ , and taking $\theta=c$ give us $l(k)=\frac{g(c)}{g(0)}$ , putting this value back will give us
\(g(\theta-c)=\frac{g(0)g(\theta)}{g(c)} \tag{**}\) Now there is a lot of prior other than $g(\theta)=c$ , which satisfy equation (** ) , so any prior of this form will be know as relatively location invariant
]]>