Jekyll2022-08-07T14:29:28+00:00https://www.iroblack.com/feed.xmliroBLACKWe are group off statistician and mathematician. Started by Rahul Goswami @ Banaras Hindu University Convergence of Markov Chain2021-04-11T00:00:00+00:002021-04-11T00:00:00+00:00https://www.iroblack.com/Convergence%20of%20Markov%20Chain<h2 id="what-is-markov-chain-">What is Markov Chain ?</h2>
<p>Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is</p>
\[P(X^{t+1}|X^t,X^{t-1},...X^2,X^1) = P(X^{t+1}| X^t)\]
<h4 id="transition-probability-matrix">Transition Probability Matrix</h4>
<p>Let us denote</p>
\[p_{ij} = P(X^{n+1} = i | X^n = j)\]
<p>Where $[Math Processing Error]p_{ij}$ denotes the probability of going from state “j” to state “i” in one step, similarly we can define $[Math Processing Error]p_{ij}^n$ as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as</p>
\[TPM = \begin{bmatrix}
p_{11} \ p_{12} \ p_{13} \ . \ .\ \\
p_{21} \ p_{22} \ p_{23} \ . \ .\ \\
\ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\
\ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\
\end{bmatrix}\]
<h2 id="why-convergence-of-markov-chain-important-">Why Convergence of Markov Chain Important ?</h2>
<h3 id="revisit-mcmc">Revisit MCMC</h3>
<p>However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important</p>
<p>Let us assume that we want to estimate certain parameter $t(\theta)$ and the model is given such that $g(\theta)$ is prior density for $\theta$ and $f(y | \theta)$ is likelihood of $y = {y_1,y_2 ……y_n}$ give the value of $\theta$ then the posterior can be written as</p>
\[g(\theta | y ) \propto g(\theta)f(y|\theta)\]
<p>Which have to be normalized , then the posterior density will be given by</p>
\[g(\theta | y ) = \frac{g(\theta)f(y|\theta)}{\int g(\theta)f(y|\theta)d\theta}\]
<p>For the sake of simplicity let us assume $t(\theta) = \theta$ and let us assume $\hat{\theta}$ is an estimate, then take Square Error Loss Function</p>
\[L(\theta , \hat{\theta}) = (\theta - \hat\theta)^2\]
<p>Then the Classical Risk will be given by $R_{\hat{\theta}}(\theta) = E_{\theta}(L(\theta,\hat{\theta}))$ and Bayes Risk is given by</p>
\[r(\hat{\theta}) = \int R_{\hat{\theta}}(\theta)g(\theta)d\theta\]
<p>Now our target is to minimize bayes risk to get the bayes estimate</p>
\[\begin{align*}
r(\theta) &= \int R_{\hat{\theta}}(\theta)g(\theta)d\theta \\
&= \int E_{\theta}(L(\theta,\hat{\theta})) g(\theta)d\theta \\
&= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)dy\right)g(\theta)d\theta \\
&= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)g(\theta)dyd\theta\right) \\
&= \int \left( \int (\theta - \hat\theta)^2g(\theta|y)d\theta\right)f(y)dy \tag{1}
\end{align*}\]
<p>The equation $({1})$ can be minimized if the inner integral is minimized, when</p>
\[\hat{\theta} = E(\theta |y)\]
<p>Now we may not always able to calculate mean of posterior density, that means</p>
\[\hat\theta = \int\theta g(\theta|y)d\theta\]
<p>That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate $\theta$ that is we take random samples from the kernel $g(\theta | y)$ i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as</p>
\[X^1,X^2......X^n \ are \ samples \ from\ g(\theta|y) \ now \\
\frac{\sum X_i}{n} \to \hat\theta \ as \ n \to \infty\]
<h2 id="when-does-markov-chain-converge-">When does Markov Chain Converge ?</h2>
<p>Now let us take $[Math Processing Error]g(\theta | y) = \pi(\theta)$ it can be assumed because y is realized and $[Math Processing Error]g(\theta | y)$ is only function of $[Math Processing Error]\theta$ , Now comes the MCMC , if we can create a chain whose stationary distribution is $[Math Processing Error]\pi(\theta)$, then we can assume that chain as a random samples which converges to $[Math Processing Error]\pi(\theta)$ and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions</p>
<p>Let us denote $\pi$ as a probability measure on $(\mathcal{X},\mathcal{B})$ and $\Phi = {X^0,X^1 …}$ are discrete time Markov Chain on $(\mathcal{X},\mathcal{B})$ , let us assume transition kernel P and k as transition density and can be illustrated as</p>
\[P(x,A) = Pr(X^{i+1} \in A | X^i = x ) = \int_A k(x,y)dy\]
<p>that is $P(x,A)$ gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators</p>
<ol>
<li>$\lambda P $ where $\lambda$ is probability distribution on $(\mathcal{X},\mathcal{B})$</li>
<li>Pf where f is non-negative measurable function on on $(\mathcal{X},\mathcal{B})$</li>
</ol>
<p>where</p>
\[\lambda P(A) = \int_{\mathcal{X}}\lambda(x)P(x,A)dx\]
<p>so if $X^i \sim \lambda$ then $\lambda P(A)$ is the marginal distribution of $X^{i+1}\in A$ and</p>
\[Pf(x) = \int_{\mathcal{X}}P(x,dy)f(y) = E_p[f(X_{i+1})|X_i = x]\]
<p>and m-step transition probability is given by</p>
\[P^m(x,A) = \int_A k^m(x,y)dy\]
<blockquote>
<p>Invariant Density - $\pi$</p>
</blockquote>
<blockquote>
\[\pi = \pi P \\
\Rightarrow \pi(x) = \int_{\mathcal{X}}\pi(y)k(y,x)dy\]
</blockquote>
<p>Now there are several way to ensure $\pi$ is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e</p>
\[\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X}\]
<p><strong>Proof</strong></p>
<p>Suppose $\pi$ satisfy the balance condition then</p>
\[\begin{align*}
\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X} \\ \\
\int_{\mathcal{X}}\pi(y)k(y,x)dy = \int_{\mathcal{X}}\pi(x)k(x,y)dy = \pi(x) \ \ \ \ \ \ \ \ \ \ \ \
\end{align*}\]
<p>However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for $\pi$ to be invariant, suppose $X^i \sim \pi$ and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to $\pi$ that is required for MCMC</p>
<p>Let us Define</p>
<p><strong>$\phi$-irreducible</strong> A Markov Chain is for some measure $\phi$ on $\mathcal{X},\mathcal{B}$ if for all $x \in X$ and $A \in \mathcal{B}$ for which $\phi(A) > 0$ , there exist n for which $P^n(x,A)>0$</p>
<p><em>A Chain is Aperiodic if Period is 1</em></p>
<p><strong>Harris Recurrent</strong> A $\phi$- irreducible Markov Chain is Harris Recurrent if a $\phi$ positive set A, the chain reaches set A with probability 1</p>
<p><strong>Harris Ergodic</strong> A Markov Chain is said to be Harris ergodic if it is $\phi$ irreducible , aperiodic , Harris Recurrent and posses invariant distribution $\pi$ for some measure $\phi$ and $\pi$</p>
<p><strong>Total Variation Distance</strong> The Total Variation distance between two measures $\mu(.) \ and \ v(.)$ is defined by</p>
\[|| \mu(.) - v(.)|| = sup_{A \in \mathcal{B}}|\mu(A)-v(A)|\]
<h4 id="what-does-harris-ergodicity-guarantees-">What does Harris Ergodicity Guarantees ?</h4>
<ul>
<li>Guaranteed to explore entire space without getting stuck</li>
<li>Strong Consistency of Markov Chain Average</li>
<li>Convergence of Markov Chain to stationary in total Variation Distance</li>
</ul>
<p>The following two theorems are very important for MCMC</p>
<p><strong>Ergodic Theorem</strong> A Markov chain $\Phi$ is Harris ergodic with Invariant Distribution $\pi$ and $E_{\pi} | g(X) | < \infty$ for some function $g : \mathcal{X} \to \Bbb{R}$ Then for any starting value $x \in \mathcal{X}$ , then</p>
\[\bar{g}_n = \frac{1}{n}\sum_{i=0}^{n-1}g(X^i) \to E_{\pi}g(X) \ almost \ surely \ as \ n \ \to \infty\]
<p>and that is the main requirement that we use generally in MCMC</p>
<blockquote>
<p>Birkhoff, George D. “Proof of the Ergodic Theorem.” <em>Proceedings of the National Academy of Sciences of the United States of America</em>, vol. 17, no. 12, 1931, pp. 656–660. <em>JSTOR</em>, www.jstor.org/stable/86016. Accessed 9 Apr. 2021.</p>
</blockquote>
<p>The other Theorem is as follows</p>
<p>*Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e</p>
\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]
<p><em>further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n</em></p>
<h2 id="rate-of-convergence">Rate of Convergence</h2>
<p>The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as <strong>geometrically ergodic</strong>, i.e there exist $M:\mathcal{X} \to \Bbb{R}$ and some constant $t \in (0,1)$ that satisfy</p>
\[||P^n(x,.)-\pi|| \leq M(x)t^n \ \ \ \ \ for \ any \ x \in \mathcal{X}\]
<p>If M is bounded , the Markov chain is uniformally ergodic</p>
<ul>
<li>As long as the starting value of x , such that M(x) is not large, geometric ergodicity guarantees quick convergence of Markov Chain</li>
<li>Geometric Ergodicity holds for every irreducible and aperiodic Markov chain on finite space</li>
</ul>
<h3 id="what-is-needed-for-geometric-ergodicity">What is Needed for Geometric Ergodicity</h3>
<h4 id="drift-and-minorization--condition">Drift and Minorization Condition</h4>
<p>A Type 1 drift condition holds if there exist some non-negative function $V:\mathcal{X} \to \Bbb{R}_{\geq 0}$ and constant $0 < \gamma <1$ and $L < \infty$</p>
\[PV(x) \leq \gamma V(x) + L \ \ \ \ \ \ \ \ \ \ \ \ for \ any \ x \in \mathcal{X}\]
<p>Further we call V a drift function and a $\gamma$ a drift rate</p>
<p>A Minorization condition holds on set $C \in \mathcal{B}$ if there exist some positive integer $m ,\epsilon > 0$ and probability measure Q in $(\mathcal{X},\mathcal{B})$ for which</p>
\[P^m(x,A) \geq \epsilon Q(A)\]
<p>we can also call this m step minorization condition, here C is called small, It imply the following condition</p>
\[k^m(x,y) \geq \epsilon q(A)\]
<p><strong>Proposition</strong></p>
<p>Suppose Markov chain $\Phi$ is irreducible and periodic with invariant distribution $\pi$ , Then $\Phi$ is geometrically ergodic if the following two conditions are met:</p>
<ol>
<li>Type I drift condition hold</li>
<li>There exists some constants $d > 2L(1-\gamma)$ for which one step minorization condition holds on set $C= {x:V(x)\leq d}$</li>
</ol>
<p>This Proposition is a Corollary of Rosenthal(1995a)</p>
<p><em>Let $\Phi$ be a a periodic and irreducible Markov chain with invariant distribution $\pi$</em></p>
<p>Let us suppose the Condition 1&2 of Proposition holds and $X^0 = x_0$ be the starting value and define</p>
\[\alpha = \frac{1+d}{1+2L+\gamma d} \ \ \ \ \ \ and \ \ \ \ \ \ \ U = 1+2(\gamma d+L)\]
<p>Then for any $r \in (0,1)$</p>
\[||P^n(x_0 ,.) - \pi(.)|| \leq (1-\epsilon)^{rn} +\left(\frac{U^r}{\alpha^{1-r}} \right)^n\left(1 + \frac{L}{1-\gamma} + V(x_0)\right)\]
<p>We can rearrange this to see that is satisfy geometric ergodicity condition</p>
<ol>
<li>V(x) + 1 is proportion to M(x) hence starting point should minimize V(x)</li>
</ol>
<p><strong>Type II Drift Condition</strong> : If there exist some function W : $\mathcal{X} \to [1,\infty)$ finite at some x $\in \mathcal{X}$, some set $D \in \mathcal{B}$ , and constants $0 < \rho < 1$ and $b < \infty$ for which</p>
\[PW(x) \leq \rho W(x) + bI_D(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ for \ all \ x \in \mathcal{X}\]
<p>It is easy to show that <em>Type I Drift Condition $\Leftarrow\Rightarrow$ Type II Drift Condition</em></p>
<p>Finally we can say that</p>
<p><strong>Suppose Markov Chain $\Phi$ is aperiodic and $\phi-$irreducible with invariant distribution $\pi$. Then $\Phi$ is geometrically ergodic if there exist some small set D, the drift function $W: \mathcal{X} \to [1,\infty)$ and some constants $0 < \rho < 1$ and $b < \infty$ for which a type II drift conditions hold</strong></p>
<p>Now Let me reinstate the earlier theorem</p>
<p><em>Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e</em></p>
\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]
<p><em>further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n</em></p>
<p><strong>Jain and Jamison (1967) have shown that for every $\phi-irreducible$ Markov chain on $(\mathcal{X},\mathcal{B})$ . Then there exists some small set $C \in \mathcal{B}$ for which $\phi(C) > 0$.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0</strong></p>
<p>the Jain and Jamison allow us to assume $C \in \mathcal{B}$ such that</p>
\[P(x , A) \geq \epsilon Q(A) \ \ \ \ \ for \ all \ x \in C\]
<p>That is one step minorization condition , Now we can write</p>
\[P(x,A) = \epsilon Q(A) + (1-\epsilon)R(x,A) \ \ \ \ \ \ \ for \ all \ x \in C \ and \ A \in \mathcal{B}\]
<p>Here $R(x,.)$ is probability measure for $(\mathcal{X},\mathcal{B})$ , then this allow us to construct two separate chain which couple with probability 1</p>
\[\Phi(X) = \{X^0,X^1 ...........\} \\
\Phi(Y) = \{Y^0,Y^1 ............\}\]
<p>Now $(X^{n},Y^n) \to (X^{n+1},Y^{n+1})$ with the following algorithm</p>
<ol>
<li>While $X^n \neq Y^n$
<ol>
<li>If $(X^n,Y^n) \not\in C \times C$
<ol>
<li>Draw $X^{n+1} \sim P(X^n,.)$ and$Y^{n+1} \sim P(Y^n,.)$ independently</li>
</ol>
</li>
<li>If $(X^n,Y^n) \in C \times C$
<ol>
<li>Draw $\delta_n \sim Bern(\epsilon)$</li>
<li>If $\delta_{n} = 0$ , Draw $X^{n+1} \sim R(X^n,.)$ and$Y^{n+1} \sim R(Y^n,.)$ independently</li>
<li>otherwise , draw $X^{n+1} = Y^{n+1} \sim P(x,.)$</li>
</ol>
</li>
</ol>
</li>
<li>Once $X^n = x = Y^n $,draw $X^{n+1} = Y^{n+1} \sim P(x,.)$</li>
</ol>
<p>Now define coupling time T such that T denotes n for which first time $(X^{n-1},Y^{n-1}) \in C \times C$ and $\delta_{n-1}=1$ , once the chain couples it will remain equal</p>
<p>Now let us assume</p>
\[X^0 = x \ and \ Y^0 \sim \pi\]
<p>And $Pr_x$ denotes the probability with respect to starting point x, then $\Phi(y)$ is stationary</p>
\[\begin{align*}
|P^n(x,A) - \pi(A)| &= |Pr_x(X^n \in A) - Pr_x(Y^n \in A)| \\
&= |Pr_x(X^n \in A,X^n = Y^n) +Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n = Y^n)| \\
&= |Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)| \\
&\leq max\{Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)\} \\
&\leq Pr_x(X^n \neq Y^n) \\
&= Pr_x(T > n)
\end{align*}\]
<p>Thus</p>
\[||P^n(x,.) - \pi(.)|| \leq Pr_x(T>n)\]
<p>Now Let us Suppose Minorization condition hold over entire space i.e $C = \mathcal{X}$ in this case every couple generated belongs to $C \times C$ for all n then</p>
\[T \sim Geo(\epsilon) \\
P(T>n) = (1-\epsilon)^n\]
<p>so</p>
\[||P^n(x,.) - \pi(.)|| \leq (1-\epsilon)^n\]
<p>so when C = $\mathcal{X}$ , $ | |P^n(x,.) - \pi(.) | | \to 0 \ as \ n \to \infty$</p>
<p>and When $C \neq \mathcal{X}$ , the distribution of $P(X>t)$ is complicated and beyond the scope of this presentation</p>
<h4 id="deterministic-update-gibbs-sampler-dugs">Deterministic Update Gibbs Sampler (DUGS)</h4>
<p>Let us assume our Target Distribution is $\pi(\theta)$ such that $\theta = (\theta_1,\theta_2….\theta_d)$</p>
<p><strong>Notation : $\theta_{-i}$ is vector of parameter except $\theta_i$</strong></p>
<p><strong>Initialization :</strong> $\theta^0 = (\theta_1^0,\theta_2^0……\theta_d^0)$</p>
<p><strong>Iteration:</strong> For $i \geq 1$</p>
<ul>
<li>Sample $\theta_1^i \sim \pi(\theta_1^i | \theta^2_{-1})$</li>
<li>Sample $\theta_2^i \sim \pi(\theta_2 | \theta_1^i , \theta^{i-1}_{-(1,2)})$</li>
<li>… .. ……………………………</li>
<li>… … ……………………………</li>
<li>Sample $\theta_1^i \sim \pi(\theta_d | \theta^{i}_{-d})$</li>
</ul>
<p>The Transition Kernel for two parameter will be given by</p>
\[k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2)) = \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)\]
<p>Let us check the stationarity for two parameter</p>
\[\begin{align*}
\int\int \pi(\theta_1,\theta_2)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2))d\theta_1d\theta_2 &= \int\int \pi(\theta_1,\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_1d\theta_2 \\
&= \int \pi(\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\
&= \int \pi(\tilde\theta_1,\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\
&= \pi(\tilde\theta_1)\cdot \pi(\tilde\theta_2|\tilde\theta_1) \\
&= \pi(\tilde\theta_2,\tilde\theta_1) \\
\end{align*}\]
<p>However this does not suffices for for the convergence, <strong>Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck</strong> If we are to prove the balance condition the we are assured that it will converge, Let $\Phi_i={\theta_i^0,\theta_i^1……..}$ and let $k_1(\tilde\theta_1,\theta_1)$ be the transition density in $\Phi_i$ , then</p>
\[\begin{align*}
\pi({\theta_1}) k_1(\tilde\theta_1,\theta_1) &= \pi({\theta_1})\int \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\
&=\pi({\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\tilde\theta_1)} d\theta_2\\
&=\pi({\tilde\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\theta_1)} d\theta_2\\
&=\pi({\tilde\theta_1}) \int {\pi(\tilde\theta_1|\theta_2)}\cdot {\pi(\tilde\theta_2|\theta_1)}d\theta_2\\
&= \pi({\tilde\theta_1}) k_1(\theta_1,\tilde\theta_1)
\end{align*}\]
<h4 id="example">Example</h4>
<p>Let us suppose</p>
\[Y_1 , Y_2 ..... Ym \sim^{iid} N(\mu, \theta)\]
<p>where $m \geq 5$ , Let us assume the joint prior density as</p>
\[g(\mu,\theta) \propto \frac{1}{\sqrt{\theta}}\]
<p>Let y = $(y_1,y_1 ……y_m)$ as a sample data with mean $\bar y$ and variance $s^2 = \sum(y_i - \bar y)^2$ the the posterior will be given by</p>
\[g(\mu , \theta | y) \propto \theta^{-\frac{m+1}{2}}exp \bigg( -\frac{1}{2\theta} \sum_{j=1}^m (y_j - \mu)^2\bigg)\]
<p>and</p>
\[\theta | \mu,y \sim IG\left(\frac{m-1}{2}, \frac{s^2+m(\mu -\bar{y})^2}{2}\right) \\
\mu | \theta ,y \sim N(\bar y,\frac{\theta}{m})\]
<p>We know Inverse Gamma have kernel $x^{-(a+1)}e^{-bx}$ with parameter (a,b)</p>
<p>Let us use DUGS Sampler in the following update scheme</p>
\[(\theta^{'},\mu{'}) \to (\theta^{},\mu{'}) \to (\theta^{},\mu{})\]
<p>so the kernel density will be given by</p>
\[k((\mu^{'},\theta^{'}),(\mu,\theta)) = \pi(\theta|\mu^{'},y)\pi(\mu|\theta,y)\]
<p><strong>Type 1 Drift Condition</strong></p>
<p>Let us define $V(\mu , \theta) = (\mu - \bar{y})^2$</p>
\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E[V(\mu,\theta)|\mu^{'}] =E[E[V(\mu,\theta)|\theta]|\mu^{'}]\]
<p>where</p>
\[E[V(\mu,\theta)|\theta] = E[(\mu-\bar{y})^2|\theta] = Var[\mu|\theta] = \frac{\theta}{m}\]
<p>Then</p>
\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E\left[\frac\theta m | \mu^{'}\right] \\
\Rightarrow \frac{1}{m} \frac{s^2+m(\mu^{'}-\bar{y})^2}{m-3} \\
\Rightarrow \frac{(\mu^{'}-\bar{y})^2}{m-3} \frac{s^2}{m(m-3)} \\
\Rightarrow \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]
<p>now $m \geq 5$ guarantees that $\frac{1}{m-3} < 1$ hence</p>
\[PV(\mu^{'},\theta^{'}) =E[V(\mu,\theta)|\mu^{'},\theta^{'}] \leq \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]
<p>So its satisfy drift condition with $\gamma \in (1/(m-3),1) $ and $L^2 =s^2/(m(m-3))$</p>
<p><strong>Minorization Condition</strong></p>
<p>Let us assume $C = {(\mu,\theta) : V(\mu,\theta) \leq d }$ for $d \geq 2L/(1-\gamma)$ if there exist density q and $\epsilon > 0$ for which</p>
\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \epsilon q(\mu,\theta)\ for \ all \ (\mu^{'},\theta^{'}) \in C \ and \ (\mu, \theta) \in \Bbb{R} \times \Bbb{R}_+\]
\[k((\mu^{'},\theta^{'}),(\mu, \theta)) = \pi(\mu|\theta,y)\pi(\theta | \mu^{'},y) \geq \pi(\mu|\theta,y) \inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y)\]
<p>Let us assume $IG(a,b ; x)$ denote the density at $ x>0$</p>
\[g(\theta) =\inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y) \\
\Rightarrow IG\left(\frac{m-1}{2},\frac{s^2}{2}+\frac{m}{2}(\mu^{'}-\bar{y})^2;\theta\right) \\
\Rightarrow
\left\{
\begin{array}{c}
IG(\frac{m-1}{2},\frac{s^2}{2}+\frac{md}{2} ; \theta ) \ \ if \ \theta < \theta^* \\IG(\frac{m-1}{2},\frac{s^2}{2} ; \theta ) \ \ if \ \theta \geq \theta^*\\
\end{array}
\right.\]
<p>where $\theta^{*} = md[(m-1)log(1+md/s^2)]^{-1}$</p>
\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \pi(\mu | \theta,y)g(\theta) = \epsilon q(\mu,\theta)\]
<p>Where $q(\mu , \theta) = \epsilon^{-1}\pi(\mu | \theta,y)g(\theta)$</p>
<p>Hence the Minorization conditions hold</p>Rahul GoswamiWhat is Markov Chain ? Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is \[P(X^{t+1}|X^t,X^{t-1},...X^2,X^1) = P(X^{t+1}| X^t)\] Transition Probability Matrix Let us denote \[p_{ij} = P(X^{n+1} = i | X^n = j)\] Where $[Math Processing Error]p_{ij}$ denotes the probability of going from state “j” to state “i” in one step, similarly we can define $[Math Processing Error]p_{ij}^n$ as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as \[TPM = \begin{bmatrix} p_{11} \ p_{12} \ p_{13} \ . \ .\ \\ p_{21} \ p_{22} \ p_{23} \ . \ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \end{bmatrix}\] Why Convergence of Markov Chain Important ? Revisit MCMC However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important Let us assume that we want to estimate certain parameter $t(\theta)$ and the model is given such that $g(\theta)$ is prior density for $\theta$ and $f(y | \theta)$ is likelihood of $y = {y_1,y_2 ……y_n}$ give the value of $\theta$ then the posterior can be written as \[g(\theta | y ) \propto g(\theta)f(y|\theta)\] Which have to be normalized , then the posterior density will be given by \[g(\theta | y ) = \frac{g(\theta)f(y|\theta)}{\int g(\theta)f(y|\theta)d\theta}\] For the sake of simplicity let us assume $t(\theta) = \theta$ and let us assume $\hat{\theta}$ is an estimate, then take Square Error Loss Function \[L(\theta , \hat{\theta}) = (\theta - \hat\theta)^2\] Then the Classical Risk will be given by $R_{\hat{\theta}}(\theta) = E_{\theta}(L(\theta,\hat{\theta}))$ and Bayes Risk is given by \[r(\hat{\theta}) = \int R_{\hat{\theta}}(\theta)g(\theta)d\theta\] Now our target is to minimize bayes risk to get the bayes estimate \[\begin{align*} r(\theta) &= \int R_{\hat{\theta}}(\theta)g(\theta)d\theta \\ &= \int E_{\theta}(L(\theta,\hat{\theta})) g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)dy\right)g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)g(\theta)dyd\theta\right) \\ &= \int \left( \int (\theta - \hat\theta)^2g(\theta|y)d\theta\right)f(y)dy \tag{1} \end{align*}\] The equation $({1})$ can be minimized if the inner integral is minimized, when \[\hat{\theta} = E(\theta |y)\] Now we may not always able to calculate mean of posterior density, that means \[\hat\theta = \int\theta g(\theta|y)d\theta\] That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate $\theta$ that is we take random samples from the kernel $g(\theta | y)$ i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as \[X^1,X^2......X^n \ are \ samples \ from\ g(\theta|y) \ now \\ \frac{\sum X_i}{n} \to \hat\theta \ as \ n \to \infty\] When does Markov Chain Converge ? Now let us take $[Math Processing Error]g(\theta | y) = \pi(\theta)$ it can be assumed because y is realized and $[Math Processing Error]g(\theta | y)$ is only function of $[Math Processing Error]\theta$ , Now comes the MCMC , if we can create a chain whose stationary distribution is $[Math Processing Error]\pi(\theta)$, then we can assume that chain as a random samples which converges to $[Math Processing Error]\pi(\theta)$ and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions Let us denote $\pi$ as a probability measure on $(\mathcal{X},\mathcal{B})$ and $\Phi = {X^0,X^1 …}$ are discrete time Markov Chain on $(\mathcal{X},\mathcal{B})$ , let us assume transition kernel P and k as transition density and can be illustrated as \[P(x,A) = Pr(X^{i+1} \in A | X^i = x ) = \int_A k(x,y)dy\] that is $P(x,A)$ gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators $\lambda P $ where $\lambda$ is probability distribution on $(\mathcal{X},\mathcal{B})$ Pf where f is non-negative measurable function on on $(\mathcal{X},\mathcal{B})$ where \[\lambda P(A) = \int_{\mathcal{X}}\lambda(x)P(x,A)dx\] so if $X^i \sim \lambda$ then $\lambda P(A)$ is the marginal distribution of $X^{i+1}\in A$ and \[Pf(x) = \int_{\mathcal{X}}P(x,dy)f(y) = E_p[f(X_{i+1})|X_i = x]\] and m-step transition probability is given by \[P^m(x,A) = \int_A k^m(x,y)dy\] Invariant Density - $\pi$ \[\pi = \pi P \\ \Rightarrow \pi(x) = \int_{\mathcal{X}}\pi(y)k(y,x)dy\] Now there are several way to ensure $\pi$ is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e \[\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X}\] Proof Suppose $\pi$ satisfy the balance condition then \[\begin{align*} \pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X} \\ \\ \int_{\mathcal{X}}\pi(y)k(y,x)dy = \int_{\mathcal{X}}\pi(x)k(x,y)dy = \pi(x) \ \ \ \ \ \ \ \ \ \ \ \ \end{align*}\] However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for $\pi$ to be invariant, suppose $X^i \sim \pi$ and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to $\pi$ that is required for MCMC Let us Define $\phi$-irreducible A Markov Chain is for some measure $\phi$ on $\mathcal{X},\mathcal{B}$ if for all $x \in X$ and $A \in \mathcal{B}$ for which $\phi(A) > 0$ , there exist n for which $P^n(x,A)>0$ A Chain is Aperiodic if Period is 1 Harris Recurrent A $\phi$- irreducible Markov Chain is Harris Recurrent if a $\phi$ positive set A, the chain reaches set A with probability 1 Harris Ergodic A Markov Chain is said to be Harris ergodic if it is $\phi$ irreducible , aperiodic , Harris Recurrent and posses invariant distribution $\pi$ for some measure $\phi$ and $\pi$ Total Variation Distance The Total Variation distance between two measures $\mu(.) \ and \ v(.)$ is defined by \[|| \mu(.) - v(.)|| = sup_{A \in \mathcal{B}}|\mu(A)-v(A)|\] What does Harris Ergodicity Guarantees ? Guaranteed to explore entire space without getting stuck Strong Consistency of Markov Chain Average Convergence of Markov Chain to stationary in total Variation Distance The following two theorems are very important for MCMC Ergodic Theorem A Markov chain $\Phi$ is Harris ergodic with Invariant Distribution $\pi$ and $E_{\pi} | g(X) | < \infty$ for some function $g : \mathcal{X} \to \Bbb{R}$ Then for any starting value $x \in \mathcal{X}$ , then \[\bar{g}_n = \frac{1}{n}\sum_{i=0}^{n-1}g(X^i) \to E_{\pi}g(X) \ almost \ surely \ as \ n \ \to \infty\] and that is the main requirement that we use generally in MCMC Birkhoff, George D. “Proof of the Ergodic Theorem.” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, 1931, pp. 656–660. JSTOR, www.jstor.org/stable/86016. Accessed 9 Apr. 2021. The other Theorem is as follows *Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e \[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\] further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n Rate of Convergence The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as geometrically ergodic, i.e there exist $M:\mathcal{X} \to \Bbb{R}$ and some constant $t \in (0,1)$ that satisfy \[||P^n(x,.)-\pi|| \leq M(x)t^n \ \ \ \ \ for \ any \ x \in \mathcal{X}\] If M is bounded , the Markov chain is uniformally ergodic As long as the starting value of x , such that M(x) is not large, geometric ergodicity guarantees quick convergence of Markov Chain Geometric Ergodicity holds for every irreducible and aperiodic Markov chain on finite space What is Needed for Geometric Ergodicity Drift and Minorization Condition A Type 1 drift condition holds if there exist some non-negative function $V:\mathcal{X} \to \Bbb{R}_{\geq 0}$ and constant $0 < \gamma <1$ and $L < \infty$ \[PV(x) \leq \gamma V(x) + L \ \ \ \ \ \ \ \ \ \ \ \ for \ any \ x \in \mathcal{X}\] Further we call V a drift function and a $\gamma$ a drift rate A Minorization condition holds on set $C \in \mathcal{B}$ if there exist some positive integer $m ,\epsilon > 0$ and probability measure Q in $(\mathcal{X},\mathcal{B})$ for which \[P^m(x,A) \geq \epsilon Q(A)\] we can also call this m step minorization condition, here C is called small, It imply the following condition \[k^m(x,y) \geq \epsilon q(A)\] Proposition Suppose Markov chain $\Phi$ is irreducible and periodic with invariant distribution $\pi$ , Then $\Phi$ is geometrically ergodic if the following two conditions are met: Type I drift condition hold There exists some constants $d > 2L(1-\gamma)$ for which one step minorization condition holds on set $C= {x:V(x)\leq d}$ This Proposition is a Corollary of Rosenthal(1995a) Let $\Phi$ be a a periodic and irreducible Markov chain with invariant distribution $\pi$ Let us suppose the Condition 1&2 of Proposition holds and $X^0 = x_0$ be the starting value and define \[\alpha = \frac{1+d}{1+2L+\gamma d} \ \ \ \ \ \ and \ \ \ \ \ \ \ U = 1+2(\gamma d+L)\] Then for any $r \in (0,1)$ \[||P^n(x_0 ,.) - \pi(.)|| \leq (1-\epsilon)^{rn} +\left(\frac{U^r}{\alpha^{1-r}} \right)^n\left(1 + \frac{L}{1-\gamma} + V(x_0)\right)\] We can rearrange this to see that is satisfy geometric ergodicity condition V(x) + 1 is proportion to M(x) hence starting point should minimize V(x) Type II Drift Condition : If there exist some function W : $\mathcal{X} \to [1,\infty)$ finite at some x $\in \mathcal{X}$, some set $D \in \mathcal{B}$ , and constants $0 < \rho < 1$ and $b < \infty$ for which \[PW(x) \leq \rho W(x) + bI_D(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ for \ all \ x \in \mathcal{X}\] It is easy to show that Type I Drift Condition $\Leftarrow\Rightarrow$ Type II Drift Condition Finally we can say that Suppose Markov Chain $\Phi$ is aperiodic and $\phi-$irreducible with invariant distribution $\pi$. Then $\Phi$ is geometrically ergodic if there exist some small set D, the drift function $W: \mathcal{X} \to [1,\infty)$ and some constants $0 < \rho < 1$ and $b < \infty$ for which a type II drift conditions hold Now Let me reinstate the earlier theorem Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e \[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\] further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n Jain and Jamison (1967) have shown that for every $\phi-irreducible$ Markov chain on $(\mathcal{X},\mathcal{B})$ . Then there exists some small set $C \in \mathcal{B}$ for which $\phi(C) > 0$.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0 the Jain and Jamison allow us to assume $C \in \mathcal{B}$ such that \[P(x , A) \geq \epsilon Q(A) \ \ \ \ \ for \ all \ x \in C\] That is one step minorization condition , Now we can write \[P(x,A) = \epsilon Q(A) + (1-\epsilon)R(x,A) \ \ \ \ \ \ \ for \ all \ x \in C \ and \ A \in \mathcal{B}\] Here $R(x,.)$ is probability measure for $(\mathcal{X},\mathcal{B})$ , then this allow us to construct two separate chain which couple with probability 1 \[\Phi(X) = \{X^0,X^1 ...........\} \\ \Phi(Y) = \{Y^0,Y^1 ............\}\] Now $(X^{n},Y^n) \to (X^{n+1},Y^{n+1})$ with the following algorithm While $X^n \neq Y^n$ If $(X^n,Y^n) \not\in C \times C$ Draw $X^{n+1} \sim P(X^n,.)$ and$Y^{n+1} \sim P(Y^n,.)$ independently If $(X^n,Y^n) \in C \times C$ Draw $\delta_n \sim Bern(\epsilon)$ If $\delta_{n} = 0$ , Draw $X^{n+1} \sim R(X^n,.)$ and$Y^{n+1} \sim R(Y^n,.)$ independently otherwise , draw $X^{n+1} = Y^{n+1} \sim P(x,.)$ Once $X^n = x = Y^n $,draw $X^{n+1} = Y^{n+1} \sim P(x,.)$ Now define coupling time T such that T denotes n for which first time $(X^{n-1},Y^{n-1}) \in C \times C$ and $\delta_{n-1}=1$ , once the chain couples it will remain equal Now let us assume \[X^0 = x \ and \ Y^0 \sim \pi\] And $Pr_x$ denotes the probability with respect to starting point x, then $\Phi(y)$ is stationary \[\begin{align*} |P^n(x,A) - \pi(A)| &= |Pr_x(X^n \in A) - Pr_x(Y^n \in A)| \\ &= |Pr_x(X^n \in A,X^n = Y^n) +Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n = Y^n)| \\ &= |Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)| \\ &\leq max\{Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)\} \\ &\leq Pr_x(X^n \neq Y^n) \\ &= Pr_x(T > n) \end{align*}\] Thus \[||P^n(x,.) - \pi(.)|| \leq Pr_x(T>n)\] Now Let us Suppose Minorization condition hold over entire space i.e $C = \mathcal{X}$ in this case every couple generated belongs to $C \times C$ for all n then \[T \sim Geo(\epsilon) \\ P(T>n) = (1-\epsilon)^n\] so \[||P^n(x,.) - \pi(.)|| \leq (1-\epsilon)^n\] so when C = $\mathcal{X}$ , $ | |P^n(x,.) - \pi(.) | | \to 0 \ as \ n \to \infty$ and When $C \neq \mathcal{X}$ , the distribution of $P(X>t)$ is complicated and beyond the scope of this presentation Deterministic Update Gibbs Sampler (DUGS) Let us assume our Target Distribution is $\pi(\theta)$ such that $\theta = (\theta_1,\theta_2….\theta_d)$ Notation : $\theta_{-i}$ is vector of parameter except $\theta_i$ Initialization : $\theta^0 = (\theta_1^0,\theta_2^0……\theta_d^0)$ Iteration: For $i \geq 1$ Sample $\theta_1^i \sim \pi(\theta_1^i | \theta^2_{-1})$ Sample $\theta_2^i \sim \pi(\theta_2 | \theta_1^i , \theta^{i-1}_{-(1,2)})$ … .. …………………………… … … …………………………… Sample $\theta_1^i \sim \pi(\theta_d | \theta^{i}_{-d})$ The Transition Kernel for two parameter will be given by \[k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2)) = \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)\] Let us check the stationarity for two parameter \[\begin{align*} \int\int \pi(\theta_1,\theta_2)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2))d\theta_1d\theta_2 &= \int\int \pi(\theta_1,\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_1d\theta_2 \\ &= \int \pi(\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \int \pi(\tilde\theta_1,\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \pi(\tilde\theta_1)\cdot \pi(\tilde\theta_2|\tilde\theta_1) \\ &= \pi(\tilde\theta_2,\tilde\theta_1) \\ \end{align*}\] However this does not suffices for for the convergence, Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck If we are to prove the balance condition the we are assured that it will converge, Let $\Phi_i={\theta_i^0,\theta_i^1……..}$ and let $k_1(\tilde\theta_1,\theta_1)$ be the transition density in $\Phi_i$ , then \[\begin{align*} \pi({\theta_1}) k_1(\tilde\theta_1,\theta_1) &= \pi({\theta_1})\int \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &=\pi({\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\tilde\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int {\pi(\tilde\theta_1|\theta_2)}\cdot {\pi(\tilde\theta_2|\theta_1)}d\theta_2\\ &= \pi({\tilde\theta_1}) k_1(\theta_1,\tilde\theta_1) \end{align*}\] Example Let us suppose \[Y_1 , Y_2 ..... Ym \sim^{iid} N(\mu, \theta)\] where $m \geq 5$ , Let us assume the joint prior density as \[g(\mu,\theta) \propto \frac{1}{\sqrt{\theta}}\] Let y = $(y_1,y_1 ……y_m)$ as a sample data with mean $\bar y$ and variance $s^2 = \sum(y_i - \bar y)^2$ the the posterior will be given by \[g(\mu , \theta | y) \propto \theta^{-\frac{m+1}{2}}exp \bigg( -\frac{1}{2\theta} \sum_{j=1}^m (y_j - \mu)^2\bigg)\] and \[\theta | \mu,y \sim IG\left(\frac{m-1}{2}, \frac{s^2+m(\mu -\bar{y})^2}{2}\right) \\ \mu | \theta ,y \sim N(\bar y,\frac{\theta}{m})\] We know Inverse Gamma have kernel $x^{-(a+1)}e^{-bx}$ with parameter (a,b) Let us use DUGS Sampler in the following update scheme \[(\theta^{'},\mu{'}) \to (\theta^{},\mu{'}) \to (\theta^{},\mu{})\] so the kernel density will be given by \[k((\mu^{'},\theta^{'}),(\mu,\theta)) = \pi(\theta|\mu^{'},y)\pi(\mu|\theta,y)\] Type 1 Drift Condition Let us define $V(\mu , \theta) = (\mu - \bar{y})^2$ \[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E[V(\mu,\theta)|\mu^{'}] =E[E[V(\mu,\theta)|\theta]|\mu^{'}]\] where \[E[V(\mu,\theta)|\theta] = E[(\mu-\bar{y})^2|\theta] = Var[\mu|\theta] = \frac{\theta}{m}\] Then \[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E\left[\frac\theta m | \mu^{'}\right] \\ \Rightarrow \frac{1}{m} \frac{s^2+m(\mu^{'}-\bar{y})^2}{m-3} \\ \Rightarrow \frac{(\mu^{'}-\bar{y})^2}{m-3} \frac{s^2}{m(m-3)} \\ \Rightarrow \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\] now $m \geq 5$ guarantees that $\frac{1}{m-3} < 1$ hence \[PV(\mu^{'},\theta^{'}) =E[V(\mu,\theta)|\mu^{'},\theta^{'}] \leq \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\] So its satisfy drift condition with $\gamma \in (1/(m-3),1) $ and $L^2 =s^2/(m(m-3))$ Minorization Condition Let us assume $C = {(\mu,\theta) : V(\mu,\theta) \leq d }$ for $d \geq 2L/(1-\gamma)$ if there exist density q and $\epsilon > 0$ for which \[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \epsilon q(\mu,\theta)\ for \ all \ (\mu^{'},\theta^{'}) \in C \ and \ (\mu, \theta) \in \Bbb{R} \times \Bbb{R}_+\] \[k((\mu^{'},\theta^{'}),(\mu, \theta)) = \pi(\mu|\theta,y)\pi(\theta | \mu^{'},y) \geq \pi(\mu|\theta,y) \inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y)\] Let us assume $IG(a,b ; x)$ denote the density at $ x>0$ \[g(\theta) =\inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y) \\ \Rightarrow IG\left(\frac{m-1}{2},\frac{s^2}{2}+\frac{m}{2}(\mu^{'}-\bar{y})^2;\theta\right) \\ \Rightarrow \left\{ \begin{array}{c} IG(\frac{m-1}{2},\frac{s^2}{2}+\frac{md}{2} ; \theta ) \ \ if \ \theta < \theta^* \\IG(\frac{m-1}{2},\frac{s^2}{2} ; \theta ) \ \ if \ \theta \geq \theta^*\\ \end{array} \right.\] where $\theta^{*} = md[(m-1)log(1+md/s^2)]^{-1}$ \[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \pi(\mu | \theta,y)g(\theta) = \epsilon q(\mu,\theta)\] Where $q(\mu , \theta) = \epsilon^{-1}\pi(\mu | \theta,y)g(\theta)$ Hence the Minorization conditions holdHighest Posterior Density Interval2020-11-01T00:00:00+00:002020-11-01T00:00:00+00:00https://www.iroblack.com/HPD-for-Scale-Parameter-of-Exponential-Distribution<p>Highest Posterior Density Interval is interval of the parmeter in which the posterir value are high when compared to any other point outside the interval (i.e. the posterior value is high in the interval). It can be defined as a 100(1-alpha)% HPD for a parameter $\theta$ is $\mathcal{C} = { \theta : \pi(\theta \vert x) \geq k }$, where k is the largest number such that</p>
\[\int_{\theta : \pi(\theta | x) \geq k } \pi(\theta | x) \mathrm{d} \theta = 1 - \alpha\]
<p>Here we can think as a horizontal line in the posterior distribution, where it intersect the posterior density function such that the area under the intersection and posterior density is equal to 1-alpha.</p>
<h3 id="example">Example</h3>
<blockquote>
<p>Following is a my Class assignment during my masters in the fall of 2019.</p>
</blockquote>
<p><strong>Let us consider the following dataset follows an exponential distribution with scale parameter ${\theta}$.Let us consider the prior for ${\theta}$. Obtain posterior distribution, Bayes estimator, and 0.95 HPD interval for the parameter.</strong></p>
<p><strong>3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22, 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02</strong></p>
<p>The density of the data model will be given by</p>
\[f(x|\theta) = \frac{1}{\theta}e^{\frac{-x}{\theta}}\]
<p>Let us notify $\sum_{i=1}^n x_i =S_n$ now the likelihood will be given by</p>
\[L(x|\theta) = \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}\]
<p>Now Since we do not have any info about $\theta$ let us assume non-informative prior</p>
\[\pi{(\theta)} = \frac{1}{\theta}\]
<p>Then the posterior will be given by</p>
\[\pi{(\theta|x)} = \frac{\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}{\int_0^{\infty}\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}\]
\[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]
<p>Now this is the density of the Inverse Gamma so</p>
\[\pi{(\theta | x)} \sim Inv-Gamma(n,S_n)\]
<p>So the bayes estimate will be given by $\frac{S_n}{n-1}$</p>
<h3 id="code">Code</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">xobs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.29</span><span class="p">,</span><span class="w"> </span><span class="m">7.53</span><span class="p">,</span><span class="w"> </span><span class="m">0.48</span><span class="p">,</span><span class="w"> </span><span class="m">2.03</span><span class="p">,</span><span class="w"> </span><span class="m">0.36</span><span class="p">,</span><span class="w"> </span><span class="m">0.07</span><span class="p">,</span><span class="w"> </span><span class="m">4.49</span><span class="p">,</span><span class="w"> </span><span class="m">1.05</span><span class="p">,</span><span class="w"> </span><span class="m">9.15</span><span class="p">,</span><span class="m">3.67</span><span class="p">,</span><span class="w"> </span><span class="m">2.22</span><span class="p">,</span><span class="w">
</span><span class="m">2.16</span><span class="p">,</span><span class="w"> </span><span class="m">4.06</span><span class="p">,</span><span class="w"> </span><span class="m">11.62</span><span class="p">,</span><span class="w"> </span><span class="m">8.26</span><span class="p">,</span><span class="w"> </span><span class="m">1.96</span><span class="p">,</span><span class="w"> </span><span class="m">9.13</span><span class="p">,</span><span class="w"> </span><span class="m">1.78</span><span class="p">,</span><span class="w"> </span><span class="m">3.81</span><span class="p">,</span><span class="w"> </span><span class="m">17.02</span><span class="p">)</span><span class="w">
</span><span class="n">Bayes_Estimate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">xobs</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">xobs</span><span class="p">)</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="c1"># Bayes Estimate</span><span class="w">
</span><span class="n">cat</span><span class="p">(</span><span class="s2">"Bayes Estimate of scale parameter is given by "</span><span class="p">,</span><span class="n">Bayes_Estimate</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Bayes Estimate of scale parameter is given by 4.954737
</code></pre></div></div>
<p>Now <strong>HPDI</strong> will br given by</p>
\[\int_{\theta : \pi(\theta|X) \geq k} \pi(\theta|X)d\theta = 1-\alpha\]
<p>where $1- \alpha = 0.95$ , here it can be thought as a horizontal line is on the posterior density such that the point where the posterior density intersect this line the area between these points will be 0.95</p>
<p>Let us take a look at posterior density function</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">xobs</span><span class="p">)</span><span class="w">
</span><span class="n">l</span><span class="w"> </span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">xobs</span><span class="p">)</span><span class="w">
</span><span class="n">curve</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">l</span><span class="p">),</span><span class="n">from</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="n">to</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/assets/images/unnamed-chunk-2-1.png" alt="" /><!-- --></p>
<p>Now let us find HPD , the posterior here is given by</p>
\[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]
<h3 id="code-for-hpdi">Code for HPDI</h3>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ruler1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">/</span><span class="p">(</span><span class="n">l</span><span class="m">+1</span><span class="p">),</span><span class="n">length</span><span class="o">=</span><span class="m">3500</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="c1">#s\(l+1) is mode of posterior</span><span class="w">
</span><span class="n">ruler2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">s</span><span class="o">/</span><span class="p">(</span><span class="n">l</span><span class="m">+1</span><span class="p">),</span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="p">,</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5000</span><span class="p">)</span><span class="w">
</span><span class="n">target</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.95</span><span class="w">
</span><span class="n">tolerance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.0005</span><span class="w">
</span><span class="n">done</span><span class="o"><-</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ruler1</span><span class="p">)</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ruler2</span><span class="p">)</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">round</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">rate</span><span class="o">=</span><span class="n">s</span><span class="p">,</span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">l</span><span class="p">),</span><span class="m">3</span><span class="p">)</span><span class="o">==</span><span class="nf">round</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">j</span><span class="p">,</span><span class="n">rate</span><span class="o">=</span><span class="n">s</span><span class="p">,</span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">l</span><span class="p">),</span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="c1">#print(paste(i,"and",j))</span><span class="w">
</span><span class="n">L</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pinvgamma</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">rate</span><span class="o">=</span><span class="n">s</span><span class="p">,</span><span class="n">shape</span><span class="o">=</span><span class="n">l</span><span class="p">)</span><span class="w">
</span><span class="n">H</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pinvgamma</span><span class="p">(</span><span class="n">j</span><span class="p">,</span><span class="n">rate</span><span class="o">=</span><span class="n">s</span><span class="p">,</span><span class="n">shape</span><span class="o">=</span><span class="n">l</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(((</span><span class="n">H</span><span class="o">-</span><span class="n">L</span><span class="p">)</span><span class="o"><</span><span class="p">(</span><span class="n">target</span><span class="o">+</span><span class="n">tolerance</span><span class="p">))</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="p">((</span><span class="n">H</span><span class="o">-</span><span class="n">L</span><span class="p">)</span><span class="o">></span><span class="p">(</span><span class="n">target</span><span class="o">-</span><span class="n">tolerance</span><span class="p">)))</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="n">done</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="k">break</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">done</span><span class="p">){</span><span class="k">break</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">HPD.L</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="p">;</span><span class="w"> </span><span class="n">HPD.U</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">j</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">target</span><span class="o">*</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="s2">"% HPD interval:"</span><span class="p">,</span><span class="w"> </span><span class="n">HPD.L</span><span class="p">,</span><span class="w"> </span><span class="s2">"to"</span><span class="p">,</span><span class="w"> </span><span class="n">HPD.U</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "95 % HPD interval: 2.94588413015964 to 7.2851736061498"
</code></pre></div></div>Rahul GoswamiHighest Posterior Density Interval is interval of the parmeter in which the posterir value are high when compared to any other point outside the interval (i.e. the posterior value is high in the interval). It can be defined as a 100(1-alpha)% HPD for a parameter $\theta$ is $\mathcal{C} = { \theta : \pi(\theta \vert x) \geq k }$, where k is the largest number such that \[\int_{\theta : \pi(\theta | x) \geq k } \pi(\theta | x) \mathrm{d} \theta = 1 - \alpha\] Here we can think as a horizontal line in the posterior distribution, where it intersect the posterior density function such that the area under the intersection and posterior density is equal to 1-alpha. Example Following is a my Class assignment during my masters in the fall of 2019. Let us consider the following dataset follows an exponential distribution with scale parameter ${\theta}$.Let us consider the prior for ${\theta}$. Obtain posterior distribution, Bayes estimator, and 0.95 HPD interval for the parameter. 3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22, 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02 The density of the data model will be given by \[f(x|\theta) = \frac{1}{\theta}e^{\frac{-x}{\theta}}\] Let us notify $\sum_{i=1}^n x_i =S_n$ now the likelihood will be given by \[L(x|\theta) = \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}\] Now Since we do not have any info about $\theta$ let us assume non-informative prior \[\pi{(\theta)} = \frac{1}{\theta}\] Then the posterior will be given by \[\pi{(\theta|x)} = \frac{\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}{\int_0^{\infty}\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}\] \[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\] Now this is the density of the Inverse Gamma so \[\pi{(\theta | x)} \sim Inv-Gamma(n,S_n)\] So the bayes estimate will be given by $\frac{S_n}{n-1}$ Code xobs <- c(3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22, 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02) Bayes_Estimate = sum(xobs)/(length(xobs)-1) # Bayes Estimate cat("Bayes Estimate of scale parameter is given by ",Bayes_Estimate) ## Bayes Estimate of scale parameter is given by 4.954737 Now HPDI will br given by \[\int_{\theta : \pi(\theta|X) \geq k} \pi(\theta|X)d\theta = 1-\alpha\] where $1- \alpha = 0.95$ , here it can be thought as a horizontal line is on the posterior density such that the point where the posterior density intersect this line the area between these points will be 0.95 Let us take a look at posterior density function s = sum(xobs) l =length(xobs) curve(dinvgamma(x , rate = s , shape = l),from=0,to=10) Now let us find HPD , the posterior here is given by \[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\] Code for HPDI ruler1 <- seq(2, s/(l+1),length=3500 ) #s\(l+1) is mode of posterior ruler2 <- seq(s/(l+1), 8 ,length = 5000) target = 0.95 tolerance = 0.0005 done<- FALSE for(i in ruler1) { for(j in ruler2) { if(round(dinvgamma(i,rate=s,shape = l),3)==round(dinvgamma(j,rate=s,shape = l),3)) { #print(paste(i,"and",j)) L <- pinvgamma(i,rate=s,shape=l) H <- pinvgamma(j,rate=s,shape=l) if (((H-L)<(target+tolerance)) & ((H-L)>(target-tolerance))) { done <- TRUE break } } } if (done){break} } HPD.L <- i; HPD.U <- j print(paste(target*100, "% HPD interval:", HPD.L, "to", HPD.U)) ## [1] "95 % HPD interval: 2.94588413015964 to 7.2851736061498"Introduction to Logistic Regression2020-10-12T00:00:00+00:002020-10-12T00:00:00+00:00https://www.iroblack.com/Logistic-Regression<p>Usually in Linear Regression we consider $X$ as a explanatory variable whose columns are $X_1 , X_2 …..X_{p}$ are the variables which we use predict are the independent variable $y$ , we measure these values on a continuous scale,When the dependent variable y is dichotomous such as, Male or Female , Pass or Fail , Malignant or Benign.</p>
<p>When we have dependent variable y is a qualitative, we can indicate it by indicator variable such as</p>
\[y = 0\ \ \ if\ female \\
y = 1 \ \ \ if \ male\]
<p>So</p>
\[y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip} + \epsilon_i \ \ \ \ \ \ i = 1,2,3,........,n\]
<p>or in the matrix form we can write</p>
\[Y = \begin{bmatrix}
y_1 \\
y_2 \\
y_3 \\
. \\
. \\
y_n \\
\end{bmatrix} \ \ X = \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & x_{1,3} & . &. & x_{1,p}\\
1 & x_{2,1} & x_{2,2} & x_{2,3} & . &. & x_{2,p}\\
. & . & . & . & . & . & x_{3,p} \\
. & . & . & . & . & . & .\\
. & . & . & . & . & . & .\\
1 & x_{n,1} & x_{n,2} & x_{n,3} & . & . & x_{n,p}\\
\end{bmatrix}
\ \
\beta = \begin{bmatrix}
\beta{0} \\
\beta{2} \\
\beta{3} \\
. \\
. \\
\beta_p \\
\end{bmatrix}
\epsilon = \begin{bmatrix}
\epsilon{1} \\
\epsilon{2} \\
\epsilon{3} \\
. \\
. \\
. \\
\epsilon_n \\
\end{bmatrix}\]
<p>that is</p>
\[Y = X\beta + \epsilon\]
<p>Remember first column of independent variable matrix X is $\underline{1}$ , for the constant $\beta_0$</p>
<p>Our dependent variable y , that we have to predict is indicator suppose it takes two values , assume y follows a bernoulli distribution</p>
\[y_i = 1 \ with \ P(y_i = 1 ) = \pi_i \\
y_i = 0 \ with \ P(y_i = 0 ) = 1-\pi_i\]
<p>Assuming $E(\epsilon_i) = 0$,</p>
\[E(y_i) = 1 \cdot \pi_i + 0 \cdot(1 - \pi_i) = \pi_i \\
E(y_i) = X\beta = \pi\]
<p>where</p>
\[\pi = \begin{bmatrix}
\pi_{1} & \pi_{2} & \pi_{3}& . & . \pi_{n}\\
\end{bmatrix}^{T}\]
<p>Now we know in Linear Regression $\epsilon$ is supposed to follow normal distribution , whereas here we cannot suppose $\epsilon$ to follow normal distribution, because here it take only two discrete values</p>
<p>so we have $E(y_i) =\pi_{i} = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip}$ where $E(y_i) \in [0,1]$ that put bound on the expected value of y</p>
<p>In logistic regression we use <strong>Standard logistic function</strong> , some people call it a <strong>Sigmoid function</strong>. It can be given by</p>
\[E(y_i) = \pi_i = \frac{1}{1+e^{-(\beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip})}} \tag{1}\]
<p>Our main work in logistic regression our main aim is to predict $\pi$ , the bernoulli parameter for $Y$ , and generally we took decision by $\pi_i$ greater than 0.5 or less than 0.5</p>
<h4 id="link-function">Link Function</h4>
<p>Usually every model have a link function which relates the linear predictor $ \eta_i $ to the mean response $ \mu_i $. First of all we have to understand what is linear predictor, it is a <strong>systematic component</strong> where $ \eta_i = E(y \vert x_i) $ ,So if $g( . )$ is a link function then</p>
\[g(\mu_i ) = \eta_i \ \ or \mu_i =g^{-1}(\eta_i)\]
<p>In the Linear regression this link is a identity link , whereas in the logistic regression $ \mu_i = E(y_i) =\pi_{i} $ so the relation between $\pi_i$ and $\eta_i = E(y \vert x_i) = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip} $ is a logistic regression so</p>
\[g(X\beta) = \pi\]
<p>We have similar equation $\eqref{1}$ we can use that to get link function</p>
\[\pi = \frac{exp(X\beta)}{1+exp(X\beta)} \\
X\beta=\eta = ln(\frac{\pi}{1-\pi})\]
<p>where $\frac{\pi}{1-\pi}$ is odds and its log is known as <strong><em>log</em>-odds</strong> ,this transformation is logit transformation.</p>
<p>It is very hard to estimate $\beta$ theoretically , so we choose gradient-descent algorithm for calculation of the parameter</p>Rahul GoswamiUsually in Linear Regression we consider $X$ as a explanatory variable whose columns are $X_1 , X_2 …..X_{p}$ are the variables which we use predict are the independent variable $y$ , we measure these values on a continuous scale,When the dependent variable y is dichotomous such as, Male or Female , Pass or Fail , Malignant or Benign. When we have dependent variable y is a qualitative, we can indicate it by indicator variable such as \[y = 0\ \ \ if\ female \\ y = 1 \ \ \ if \ male\] So \[y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip} + \epsilon_i \ \ \ \ \ \ i = 1,2,3,........,n\] or in the matrix form we can write \[Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ . \\ . \\ y_n \\ \end{bmatrix} \ \ X = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & . &. & x_{1,p}\\ 1 & x_{2,1} & x_{2,2} & x_{2,3} & . &. & x_{2,p}\\ . & . & . & . & . & . & x_{3,p} \\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ 1 & x_{n,1} & x_{n,2} & x_{n,3} & . & . & x_{n,p}\\ \end{bmatrix} \ \ \beta = \begin{bmatrix} \beta{0} \\ \beta{2} \\ \beta{3} \\ . \\ . \\ \beta_p \\ \end{bmatrix} \epsilon = \begin{bmatrix} \epsilon{1} \\ \epsilon{2} \\ \epsilon{3} \\ . \\ . \\ . \\ \epsilon_n \\ \end{bmatrix}\] that is \[Y = X\beta + \epsilon\] Remember first column of independent variable matrix X is $\underline{1}$ , for the constant $\beta_0$ Our dependent variable y , that we have to predict is indicator suppose it takes two values , assume y follows a bernoulli distribution \[y_i = 1 \ with \ P(y_i = 1 ) = \pi_i \\ y_i = 0 \ with \ P(y_i = 0 ) = 1-\pi_i\] Assuming $E(\epsilon_i) = 0$, \[E(y_i) = 1 \cdot \pi_i + 0 \cdot(1 - \pi_i) = \pi_i \\ E(y_i) = X\beta = \pi\] where \[\pi = \begin{bmatrix} \pi_{1} & \pi_{2} & \pi_{3}& . & . \pi_{n}\\ \end{bmatrix}^{T}\] Now we know in Linear Regression $\epsilon$ is supposed to follow normal distribution , whereas here we cannot suppose $\epsilon$ to follow normal distribution, because here it take only two discrete values so we have $E(y_i) =\pi_{i} = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip}$ where $E(y_i) \in [0,1]$ that put bound on the expected value of y In logistic regression we use Standard logistic function , some people call it a Sigmoid function. It can be given by \[E(y_i) = \pi_i = \frac{1}{1+e^{-(\beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip})}} \tag{1}\] Our main work in logistic regression our main aim is to predict $\pi$ , the bernoulli parameter for $Y$ , and generally we took decision by $\pi_i$ greater than 0.5 or less than 0.5 Link Function Usually every model have a link function which relates the linear predictor $ \eta_i $ to the mean response $ \mu_i $. First of all we have to understand what is linear predictor, it is a systematic component where $ \eta_i = E(y \vert x_i) $ ,So if $g( . )$ is a link function then \[g(\mu_i ) = \eta_i \ \ or \mu_i =g^{-1}(\eta_i)\] In the Linear regression this link is a identity link , whereas in the logistic regression $ \mu_i = E(y_i) =\pi_{i} $ so the relation between $\pi_i$ and $\eta_i = E(y \vert x_i) = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip} $ is a logistic regression so \[g(X\beta) = \pi\] We have similar equation $\eqref{1}$ we can use that to get link function \[\pi = \frac{exp(X\beta)}{1+exp(X\beta)} \\ X\beta=\eta = ln(\frac{\pi}{1-\pi})\] where $\frac{\pi}{1-\pi}$ is odds and its log is known as log-odds ,this transformation is logit transformation. It is very hard to estimate $\beta$ theoretically , so we choose gradient-descent algorithm for calculation of the parameterSupervised Learning with Scikit Learn2020-07-07T00:00:00+00:002020-07-07T00:00:00+00:00https://www.iroblack.com/Supervised%20Learning%20with%20Scikit%20Learn<p>Machine Learning is the art of giving computers the ability to learn from data and make decisions on their own without explicitly programmed for example</p>
<ul>
<li>The determination of benign and malign according to the tumor size</li>
<li>Google News Selecting similar news and making a cluster of news which are related</li>
<li>Classifying emails in the spam or not a spam</li>
<li>Prediction of house pricing according to the number of rooms, furnishing, age, etc.</li>
<li>Detection of where a bank transaction is fraud or not</li>
</ul>
<p>There are many more examples of machine learning, here we are going to discuss <strong>Supervised Machine Learning</strong>, There are two parts of data <em>features</em> and <em>labels</em>, features are the input for the model just like the size of tumors is if we put the size of tumors the model will tell us whether it is malign or benign, the prediction here whether malign or benign are the labels, there some types of data which does not contain labels such as a grouping of news which are related does not require any labels, but here in supervised learning, we are concerned with data labels, so loosely we can say the Machine Learning modeling with labels are known as supervised learning.</p>
<p>For further understanding, we are going to use iris datasets, which have 4 features Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and one target variable Species</p>
<p><img src="/assets/images/Presentation2.png" alt="" /></p>
<p>This is a long dataset with labels <strong>virginica, setosa and Versicolor</strong> however we are representing only part of data so we can see in the target column we have the only setosa</p>
<p>The realization of the target variable is known as labels however most of the data scientists use them interchangeably. The predictor variable and feature are the same thing and also known as the independent variable, while the target variable is known as the dependent variable</p>
<h1 id="classification">Classification</h1>
<p>Classification is a machine learning models which classify things , such as classifying mail is spam or not , or in the iris data classifying where the plant is virginica , setosa or versicolor is the classification.</p>
<p>First of all we gonna load our dataset using the following codes, which also imports pandas and numpy under their usual aliases.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">iris</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="n">load_iris</span><span class="p">()</span>
<span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sklearn.utils.Bunch
</code></pre></div></div>
<p>We can see that iris dataset is a bunch, bunch is a datatypes which have a key value pairs, we can look at the pairs using following code</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">data</span><span class="p">),</span><span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">target</span><span class="p">),</span><span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">target_names</span><span class="p">),</span><span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">DESCR</span><span class="p">),</span><span class="nb">type</span><span class="p">(</span><span class="n">iris</span><span class="p">.</span><span class="n">feature_names</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(numpy.ndarray, numpy.ndarray, numpy.ndarray, str, list)
</code></pre></div></div>
<p>We can see that iris.data and iris.target is numpy array , also target names is also an array , DESCR is string and features names is string, if we <code class="language-plaintext highlighter-rouge">iris.data.shape</code> and <code class="language-plaintext highlighter-rouge">iris.target.shape</code> we can see data has shape 150 rows and 4 columns and this is our features,we can take a look at our data by the command <code class="language-plaintext highlighter-rouge">print(iris.data)</code> , similarly the shape of target variable have 150 rows and 1 columns as we expected and we can look at it using <code class="language-plaintext highlighter-rouge">print(iris.target)</code> However our target variable is encoded where</p>
<ul>
<li>0 represent setosa</li>
<li>1 represent versicolor</li>
<li>2 represent virginica</li>
</ul>
<p>It can be seen using <code class="language-plaintext highlighter-rouge">iris.targets</code> and it is also described in <code class="language-plaintext highlighter-rouge">iris.descr</code>, let us store <strong>iris.data</strong> in variable X and <strong>iris.target</strong> in y</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">target</span>
</code></pre></div></div>
<p>Let us construct a dataframe from the <strong>X</strong> which have header as <strong>iris.feature_names</strong> and show how our dataframe actuaaly looks like using <code class="language-plaintext highlighter-rouge">head()</code> method</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X</span> <span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">feature_names</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<table style="width:100%">
<thead>
<tr>
<th></th>
<th>sepal length (cm)</th>
<th>sepal width (cm)</th>
<th>petal length (cm)</th>
<th>petal width (cm)</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>5.1</td>
<td>3.5</td>
<td>1.4</td>
<td>0.2</td>
</tr>
<tr>
<th>1</th>
<td>4.9</td>
<td>3.0</td>
<td>1.4</td>
<td>0.2</td>
</tr>
<tr>
<th>2</th>
<td>4.7</td>
<td>3.2</td>
<td>1.3</td>
<td>0.2</td>
</tr>
<tr>
<th>3</th>
<td>4.6</td>
<td>3.1</td>
<td>1.5</td>
<td>0.2</td>
</tr>
<tr>
<th>4</th>
<td>5.0</td>
<td>3.6</td>
<td>1.4</td>
<td>0.2</td>
</tr>
</tbody>
</table>
<h2 id="k-nearest-neighbours">k-Nearest Neighbours</h2>
<p>Now let us train our first model using the k-Nearest Neighbors (or kNN), it is quite simple, first suppose there are only two features in our dataset then we can plot each observation (that is a single row in a dataset ) simply on the 2D plane as a point where the first feature is on the x-axis and second feature on the y-axis, and suppose the color of the point is a label that can be red or blue, suppose we get a feature with know label on it only with two features now we can plot that point on the same 2D plane but we cannot determine the color of the point since it is not labeled, now we have to predict label suppose we take 3 nearest observation on the plane then it is kNN with k=3 now we have to take the majority vote of 3 nearest neighbors, 2 of them is blue so our prediction is blue, our prediction may change with change in k, suppose k=5 now out of the 5 nearest neighbors 3 are red and 2 are blue then we predict red</p>
<p><img src="/assets/images/Slide2.png" alt="image" /></p>
<p>This algorithm can be extended to n features where n number of features is greater than 2, by plotting the points in an n-dimensional euclidean plane and then computing the nearest neighbors</p>
<h3 id="training-and-prediction">Training and Prediction</h3>
<p>In Scikit Learn there are two important methods <code class="language-plaintext highlighter-rouge">.fit</code> that will be useful for training the model and <code class="language-plaintext highlighter-rouge">.predict</code> to predict the label using a trained model, now to use kNN we have to import <strong>sklearn.neighbors</strong> from sklearn library using <code class="language-plaintext highlighter-rouge">from import KNeighborsClassifier</code> and then we have to initialize it and set the value for k let set it to 5 using <code class="language-plaintext highlighter-rouge">KNeighborsClassifier(n_neighbors=5)</code> then we will fit the data using <code class="language-plaintext highlighter-rouge">.fit</code> method</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsClassifier</span>
<span class="n">knn_model</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="c1">#Storing the model in varible knn_model
</span><span class="n">knn_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span> <span class="c1">#Fitting ot training the model
</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
</code></pre></div></div>
<p>Now we have trained our model and stored it into the variable <strong>knn_model</strong> , Now if we have given the sepal length, width, and petal length, we can predict the species let us predict for 4.6,3.8,3.7,0.9 as sepal length, sepal width, petal length and petal width using .predict method</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">knn_model</span><span class="p">.</span><span class="n">predict</span><span class="p">([[</span><span class="mf">4.4</span><span class="p">,</span><span class="mf">3.8</span><span class="p">,</span><span class="mf">3.7</span><span class="p">,</span><span class="mf">0.9</span><span class="p">]])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([1])
</code></pre></div></div>
<p>Hence we can see we have predicted 1 which represents <strong>Versicolor</strong> similarly we can do a lot of prediction at once by creating a NumPy array and then passing it as an argument to the knn_model.predict(), we must take care that the number of columns is equal to the number of features that we have used to train the model, now let us see an example</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">array</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">4.4</span><span class="p">,</span><span class="mf">3.8</span><span class="p">,</span><span class="mf">3.7</span><span class="p">,</span><span class="mf">0.9</span><span class="p">],</span>
<span class="p">[</span><span class="mf">3.2</span><span class="p">,</span><span class="mf">5.7</span><span class="p">,</span><span class="mf">2.0</span><span class="p">,</span><span class="mf">1.3</span><span class="p">],</span>
<span class="p">[</span><span class="mf">5.5</span><span class="p">,</span><span class="mf">1.9</span><span class="p">,</span><span class="mf">2.8</span><span class="p">,</span><span class="mf">4.7</span><span class="p">],</span>
<span class="p">[</span><span class="mf">3.2</span><span class="p">,</span><span class="mf">9.7</span><span class="p">,</span><span class="mf">6.2</span><span class="p">,</span><span class="mf">1.0</span><span class="p">]])</span>
<span class="n">prediction</span><span class="o">=</span><span class="n">knn_model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">array</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">prediction</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1 0 2 2]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">iris</span><span class="p">.</span><span class="n">target_names</span><span class="p">[</span><span class="n">prediction</span><span class="p">]</span>
</code></pre></div></div>
<p>Now we can get decoded species name by passing the prediction to iris.target as an index</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array(['versicolor', 'setosa', 'virginica', 'virginica'], dtype='<U10')
</code></pre></div></div>
<h2 id="measuring-the-performance">Measuring the performance</h2>
<p>Now we have trained or model , now we must measure the performance of our model to get the idea of how good or how bad is our model,there are various metric to measure the performance such as Accuracy , Precision , F-Measure etc. but one of the question we have is to which data to use for calculating performance since the data used for training will give too optimistic metric , and may be good only for the data that we have used for training however our main target in machine learning models to train the data such that is predicts the labels for new data, so we need to calculatr our metric on the new data but that is not possible since new data will not be labeled , so a typical operating procedure for a datascientists to split the data into train and test sets where train set will be used for training and the test set will be used for testing and so on calulating the metric such as Accuracy and all we are going to use accuracy here that is equal to the total true prediction divided by toal number of prediction , suppose we 100 observation in test sets and out of them our model predicte 75 of them true , that means there are 75 prediction ehic are right and 25 are wrong so at las t we can say accuracy</p>
<h3 id="splitting-the-dataset-into-train-and-test-sets">Splitting the dataset into Train and Test sets</h3>
<p>To split the dataset, first of all, we will import <code class="language-plaintext highlighter-rouge">train_test_split</code> from <code class="language-plaintext highlighter-rouge">sklearn.model_selection</code>, now the method train_test_split() will take some arguments, the first argument will be feature data and the second will be labels and that will be train_test_split(X,y) however this method will work fine, but to increase the usability of method it can take more arguments such as</p>
<ul>
<li><strong>test_size</strong> which is a proportion of the test set, default is set to 0.25, which means it will split 25% of the data as a test set and 75% train set however if someone wants test set to be 20% and train set 80% they can use <code class="language-plaintext highlighter-rouge">test_size = 0.2</code></li>
<li><strong>random_state</strong> it is the seed for the generation of random numbers, look the train_test_split method split the dataset randomly, it does not just take 25% data from the data for the test set, it randomly selects data for test sets, suppose we want in future to generate same train and test set in future for our datasets we can generate same test and train dataset using the same random_state</li>
<li><strong>stratify</strong> argument is “y” if we want our test set to have the same proportion of labels as our dataset, this argument stratify dataset according to the labels, suppose in our iris dataset there are three labels <strong>setosa, Versicolor and Virginica</strong> Now in this case our dataset will be split into three datasets first containing only those observation whose label is setosa, second with Versicolor and the third with virginica, then it will take 25% of them, randomly from all of them and merge them to create test set, in this way we know the proportion of setosa, Versicolor and viginica is same in the test set and iris dataset</li>
</ul>
<p>Let us talk about the output of the <em>train_test_split</em> method it will give four arrays, the feature of the train set, the feature of the test set, labels of the train set and labels of the test sets, lets split our dataset, and train our model on the training set</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">X_train</span> <span class="p">,</span> <span class="n">X_test</span> <span class="p">,</span> <span class="n">y_train</span> <span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.25</span> <span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y</span><span class="p">)</span>
<span class="n">knn_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
</code></pre></div></div>
<p>Now we will use the trained model to predict the labels of test set X_test</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_test_prediction</span> <span class="o">=</span> <span class="n">knn_model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">X_test_prediction</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([2, 1, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 1, 2, 1, 0, 2, 0, 1, 0, 1, 2,
2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2])
</code></pre></div></div>
<p>Further we use <code class="language-plaintext highlighter-rouge">.score</code> method to calculate Accuracy , this method will take arguments the test set and labels of the test sets</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">knn_model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span><span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.9736842105263158
</code></pre></div></div>
<p>Hence we can see that we have about 97% Accuracy. Here we have used k=5, but question what will happen if we increase <strong>k</strong>. kNN models create a decision boundary, which divides the whole euclidean space into different regions where the number of regions is the number of classes, in our example kNN will divide the 4-dimensional euclidean space(4 dimensional because there are four features) into 3 regions and any new data label will be decided upon in which region it falls, our question here is what happens it we increase k so as we increase our decision boundary will smoothen.</p>
<p><img src="/assets/images/alpha.png" alt="Phoflso" />
<span>Photo credit: <a href="http://faculty.marshall.usc.edu/gareth-james/ISL/"><strong>An Introduction to Statistical Learning with Applications in R</strong> (Available for FREE!!! <i class="far fa-laugh-beam"></i> )</a>
</span></p>
<p>As we can see for k=1 our decision boundary represented by black is too much fitted as k=100 we can see that our decision boundary is too much smoothed. So if k is large the decision boundary will be smoother hence a less complex model however for small k the decision boundary will less smooth and give a complex model, which will be more sensitive to the noise in the data, which may give a good prediction for training data but may fail on new data, this is also known as <strong>overfitting</strong> if we increase k too much the decision boundary will be too much smoothed (tend to become straight line) and may not perform well on both of the test and train set as can see in the figure for k=100 and this is commonly known as <strong>underfitting</strong> so we must choose k such that neither it is under fitted nor overfitted that means choose k neither too large neither too small, for k=10 we will get following</p>
<p><img src="/assets/images/beta.png" alt="Phoflso" />
<span>Photo credit: <a href="http://faculty.marshall.usc.edu/gareth-james/ISL/"><strong>An Introduction to Statistical Learning with Applications in R</strong> (Available for FREE!!! <i class="far fa-laugh-beam"></i> )</a>
</span></p>
<h1 id="confusion-matrix">Confusion Matrix</h1>
<p>Accuracy is not always a good metric for measuring the performance of classification problems, suppose we have data for transactions from a bank and we have to create a model which classify whether a transaction is a fraud or not fraud, usually a lot of transactions are non-fraudulent let us say 95% are not fraudulent, this type of data is known as imbalanced data when one of the class is too frequent and for imbalanced data out accuracy metric does not perform well for imbalanced data, so there are other metrics to measure the performance of a model and they can be obtained from a very famous matrix known as Confusion Matrix</p>
<p>In Binary Classification there are two classes <em>Positive</em> and <em>Negative</em>, we call those classes positive class which we are interested in, suppose we want to model a transaction fraud then we are interested in the transactions which are fraud, then the class fraud is positive class and non-fraud class is negative, Various Metric can be calculated by By the Following Formulas</p>
<p><img src="/assets/images/Slide4(1).png" alt="alp" /></p>
<p>F1- Score can also be interpreted as Harmonic Mean of Precision and Recall , and given by</p>
\[F1 \ Score = 2 \cdot \frac{precision \cdot recall}{precision + recall}\]
<p>Confusion Matrix can be calculated</p>
<ol>
<li>import confusion_matrix method from sklearn.metrics</li>
<li>use confusion_matrix ,with a first argument actual test labels and second argument as prediction of the lebels</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">confusion_matrix</span>
<span class="k">print</span><span class="p">(</span><span class="n">confusion_matrix</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">X_test_prediction</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[13 0 0]
[ 0 13 0]
[ 0 1 11]]
</code></pre></div></div>
<p>Here we got $3\times 3 $ matrix because we have 3 classes , we are not limited to only two class positive negative, here we have three classes of labels i.e ‘versicolor’, ‘setosa’, ‘virginica’, now to get the performance metrics we have to run the following codes</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">X_test_prediction</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> precision recall f1-score support
0 1.00 1.00 1.00 13
1 0.93 1.00 0.96 13
2 1.00 0.92 0.96 12
accuracy 0.97 38
macro avg 0.98 0.97 0.97 38
weighted avg 0.98 0.97 0.97 38
</code></pre></div></div>
<p></p>
<h1 id="regression">Regression</h1>
<p>In regressions target variable is a continuous variable as price of a mobile,temperature and etc. To get started let us took diabetese dataset , which is already persent in sklearn module</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">boston</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="n">load_boston</span><span class="p">()</span>
</code></pre></div></div>
<p>Let us take a look at what we have imported in data variable using <code class="language-plaintext highlighter-rouge">.keys()</code> attribute</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">boston</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
</code></pre></div></div>
<p>Now we have data and the feature names , so we can create a dataframe from the <code class="language-plaintext highlighter-rouge">data</code> and feature names and can take a look at the from the <code class="language-plaintext highlighter-rouge">head</code> method, as we have done in Classification</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">boston</span><span class="p">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">boston</span><span class="p">.</span><span class="n">target</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">X</span> <span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">boston</span><span class="p">.</span><span class="n">feature_names</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<table>
<thead>
<tr style="text-align: right;">
<th></th>
<th>CRIM</th>
<th>ZN</th>
<th>INDUS</th>
<th>CHAS</th>
<th>NOX</th>
<th>RM</th>
<th>AGE</th>
<th>DIS</th>
<th>RAD</th>
<th>TAX</th>
<th>PTRATIO</th>
<th>B</th>
<th>LSTAT</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.00632</td>
<td>18.0</td>
<td>2.31</td>
<td>0.0</td>
<td>0.538</td>
<td>6.575</td>
<td>65.2</td>
<td>4.0900</td>
<td>1.0</td>
<td>296.0</td>
<td>15.3</td>
<td>396.90</td>
<td>4.98</td>
</tr>
<tr>
<th>1</th>
<td>0.02731</td>
<td>0.0</td>
<td>7.07</td>
<td>0.0</td>
<td>0.469</td>
<td>6.421</td>
<td>78.9</td>
<td>4.9671</td>
<td>2.0</td>
<td>242.0</td>
<td>17.8</td>
<td>396.90</td>
<td>9.14</td>
</tr>
<tr>
<th>2</th>
<td>0.02729</td>
<td>0.0</td>
<td>7.07</td>
<td>0.0</td>
<td>0.469</td>
<td>7.185</td>
<td>61.1</td>
<td>4.9671</td>
<td>2.0</td>
<td>242.0</td>
<td>17.8</td>
<td>392.83</td>
<td>4.03</td>
</tr>
<tr>
<th>3</th>
<td>0.03237</td>
<td>0.0</td>
<td>2.18</td>
<td>0.0</td>
<td>0.458</td>
<td>6.998</td>
<td>45.8</td>
<td>6.0622</td>
<td>3.0</td>
<td>222.0</td>
<td>18.7</td>
<td>394.63</td>
<td>2.94</td>
</tr>
<tr>
<th>4</th>
<td>0.06905</td>
<td>0.0</td>
<td>2.18</td>
<td>0.0</td>
<td>0.458</td>
<td>7.147</td>
<td>54.2</td>
<td>6.0622</td>
<td>3.0</td>
<td>222.0</td>
<td>18.7</td>
<td>396.90</td>
<td>5.33</td>
</tr>
</tbody>
</table>
<p>Before Training the model let us split our data , we can not use stratify attribute here because our target varible is not categorical.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_train</span> <span class="p">,</span> <span class="n">X_test</span> <span class="p">,</span> <span class="n">y_train</span> <span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.25</span> <span class="p">)</span>
</code></pre></div></div>
<h2 id="linear-regression">Linear Regression</h2>
<p>When we assume the target variable y is a linear function of columns of X, or we can say linear functions of features the model is known as linear regression, it can be represented as</p>
\[\hat{y}_i = \sum_{i=0}^p a_{i}x^{i}\]
<p>Linear regression is an equation of line,and $a_i$ are known as parameters of linear regression.</p>
<p>Now our main aim is to set $ a_i $ such as the predicted value of y generally represented by is nearest to the actual value of y, to measure the amount of difference between the predicted and actual we use loss functions, these are a special type of functions which give 0 when the predicted value for the label is equal to the actual label, one of the most common loss function is <strong>squared error loss function</strong> given by</p>
\[Loss(\hat{y} ;y)= \sum_{i=0}^n(y_i - \hat{y}_i)^{2}\]
<p>So our problem is to reduce loss, so to reduce we have to set optimized parameters which reduce loss, but for this type of loss the Estimation of the parameters that are $a_i$ are known as a least square estimate, for a different type of loss functions we can get different types estimate,but least square estimates are most used so we are gonna discuss this</p>
<blockquote>
<p>p is the number of features , hence there will be (p+1) parameters , where we added 1 due to the fact that we have to also estimate the constant term $a_0$ and “n” is the number of observations , or we can say number of rows in the dataset</p>
</blockquote>
<p>Now to fit the model , we will run the following code</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
</code></pre></div></div>
<p>Now we can predict using <code class="language-plaintext highlighter-rouge">.predict</code> method as follows</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prediction</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<p>As we have seen the metric to measure the performance of a model is <strong>Accuracy</strong> in the classification section however for regression we can not use Accuracy one of the mostly used metric for regression is $R^2$ which is defined as <strong>proportion of variability in Y that can be explained using X</strong> , it can be calculated by following code</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span><span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.736994702163782
</code></pre></div></div>
<p>Generally $R^2$ range vary from 0 to 1.
When $R^2$ is near to 1 it represent that the model is good and when it is near to 0 the model fitted is not good</p>
<h2 id="cross-validation">Cross Validation</h2>
<p>Cross-Validation is a method that reduces our dependency on how the data splits in train and test, there may be, only by chance that our performance metrics are representing our model as good, this is due to the fact we do not use all the data to calculate performance metrics. To eradicate the dependency on only one train test splits data we use Cross-Validation, or we can say k-fold Cross-validation where is k is a parameter and a positive integer suppose k=5 means there are 5-fold cross-validation, it simply divide our observations in our dataset into 5 groups commonly known as a fold, then we hold the first fold as a test set and all other folds are merged to create train set and then we calculate the performance metric we are interested in, and then do the same again by holding second fold as a test set and remaining as train set and after that calculate performance metric, this is known as a performance metric for the second split, similarly in k fold cross validation we calculate performance metric k times for k splits and further after calculating metric for every split we can calculate statistics of our interest such as the mean of these k performance metrics or mode, median or whatever statistic we want</p>
<p><img src="/assets/images/Slide3.png" alt="" /></p>
<p>k-Fold Cross Validation is computationally expensive , since we have to do the whole process of training, prediction and metric calculation k times , following is the way to do so</p>
<ol>
<li>Import <code class="language-plaintext highlighter-rouge">cross_val_score</code></li>
<li>call the <code class="language-plaintext highlighter-rouge">cross_val_score</code> with arguments the model , features array , labels , number of fold suppose for 5 fold <code class="language-plaintext highlighter-rouge">cv=5</code>, and store it in a variable</li>
<li>Call the statistics function such as mean or mode on the variable such as <code class="language-plaintext highlighter-rouge">np.mean()</code> for mean</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span> <span class="c1">#Importing class
</span><span class="n">cross_validation_result</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">cv</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span> <span class="c1">#Initializing
</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">cross_validation_result</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.3532759243958772
</code></pre></div></div>
<h2 id="shrinkage-method">Shrinkage Method</h2>
<p>Shrinkage is also known as <strong>Regularization</strong>, In general, we estimate the parameters , but sometimes they are two large and lead to higher variance so it is advisable to shrink the parameters toward 0, it can be done in various ways two of the famous one is <strong>Ridge Regression</strong> and <strong>Lasso</strong></p>
<h3 id="ridge-regression">Ridge Regression</h3>
<p>For Ridge Regression we just edit our general Loss function as following</p>
\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p a_i^2\]
<p>Where $\alpha \geq 0$ is a <em>tuning parameter</em> and $\alpha \sum_{i=1}^p a_i^2$ is known as shrinkage penalty, here we must note that we have not the term for in the shrinkage penalty, unlike Least Square Estimate here we get different sets of parameters for different value of tuning parameter, however for tuning parameter equal to zero will lead to Least Square Estimate and may have a greater chance of overfitting, and a very large tuning parameter will penalize the parameters too much which can lead to underfitting so we have to choose tuning parameter such as it optimizes our model</p>
<p>To do Ridge Regression</p>
<ol>
<li>Import <code class="language-plaintext highlighter-rouge">Ridge</code> from the module sklearn.linear_model</li>
<li>Then initialize <code class="language-plaintext highlighter-rouge">Ridge()</code> class , with passing the tuning parameter to <code class="language-plaintext highlighter-rouge">alpha</code> argument</li>
<li>Then Train and predict as usual</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="n">ridge_model</span> <span class="o">=</span> <span class="n">Ridge</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span> <span class="mf">0.9</span> <span class="p">)</span>
<span class="n">ridge_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">ridge_model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span><span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.7345197081669743
</code></pre></div></div>
<h3 id="lasso-regression">Lasso Regression</h3>
<p>Ridge Regression has a demerit that it shrinks the parameters towards 0, but never set the parameters equal to 0, there may be some features which don’t explain any variance in the label that coefficient needs to be set equal to zero, to increase the model interpretation. For Lasso, we just add modulus of the parameters at the place of the square of parameters as in Loss of Ridge Regression</p>
\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p |{a_i}|\]
<p>Lasso shrinks the coefficient of feature to 0 for the features which are less important</p>
<p>Lasso Regression have similar codes scrips as ridge Regression</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Lasso</span>
<span class="n">lasso_model</span> <span class="o">=</span> <span class="n">Ridge</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">lasso_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">lasso_model</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span><span class="n">y_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.7257977554026047
</code></pre></div></div>
<h1 id="logistic-regression">Logistic Regression</h1>
<p>Logistic Regression, despite its a regression it is used in classification problem mostly, it finds out the probability that a given observation belongs to a particular class if it is greater than 0.5 or we can say 50%, then our model predict the observation label belong to that class, It estimates the probability using the following function</p>
\[p = \sigma\left(\sum_{i=0}^p a_ix^i\right)= \frac{1}{1+ e^{-\sum_{i=0}^p a_ix^i}}\]
<p>But we will not go in theory too much, and focus on practical use.
To Use Logistic Regression, it is similar to the work we have done earlier, import function, import data, split data, then test you, models, using performance metrics Let us do that, let us do this on breast cancer data, that is already available in sklearn module</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">bcancer</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="n">load_breast_cancer</span><span class="p">()</span> <span class="c1">#Loading Data
</span><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span> <span class="c1">#Importic class for logistic regression
</span><span class="n">LogReg_MODEL</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">()</span> <span class="c1">#Initializing Logistic Regression class
</span><span class="n">X</span> <span class="o">=</span> <span class="n">bcancer</span><span class="p">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">bcancer</span><span class="p">.</span><span class="n">target</span>
<span class="n">X_train</span> <span class="p">,</span> <span class="n">X_test</span> <span class="p">,</span> <span class="n">y_train</span> <span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">test_size</span> <span class="o">=</span> <span class="mf">0.25</span> <span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">y</span><span class="p">)</span> <span class="c1">#Splitting Data
</span><span class="n">LogReg_MODEL</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span> <span class="c1">#Training the model
</span><span class="n">AccuracyLogReg</span> <span class="o">=</span> <span class="n">LogReg_MODEL</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span><span class="n">y_test</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">AccuracyLogReg</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.958041958041958
</code></pre></div></div>
<h1 id="roc-curve">ROC Curve</h1>
<p>ROC Curve is short form for receiver operating characteristic curve.</p>
<p><strong>Threshold</strong></p>
<p>We generally take threshold 0.5 that means in <em>kNN</em> when the number of a particular class label is greater than 0.5 of the total class label we predict it belongs to that class label , suppose we fitted <strong>kNN</strong> for k=100 , and we have two class label red and blue then we will predict red for an observation if more than 50 of the neighbors are red that 50 is the threshold number , that is number of red neighbors to classify it as red , that 50 is 0.5$\times$ 100 , so here we have threshold 0.5 , similarly in logistic regression p=0.5 is threshold in general</p>
<p><strong>True Positive Rate and False Positive Rate (TPR and FPR)</strong></p>
<p>True Positive Rate is also known as <em>Recall</em> and false positive rate is given by</p>
\[FPR = \frac{FP}{FP+TN}\]
<p>Model always do not perform well when the threshold is 0.5 sometimes , model performs better with threshold other than 0.5 ,to know that we use ROC curve , ROC curve is a graph between TPR and FPR and for different threshold we get different ROC curve</p>
<ul>
<li>When threshold is 0 , means we will predict all the observation as positive and then TPR will be equal to 1 and false positive rate will also be 1</li>
<li>When threshold is 1 , both TPR and FPR will be equal to 0</li>
</ul>
<p>To know how good is or model we use the area under curve (AUC) as a performance metric for ROC curve, lets say we have a perfect classifying model then TPR will be equal to 1 and FPR will be equal to 0 , this will be when area under the curve equal to 1 , so we can use AUC ROC as a performance metrics</p>
<p><img src="/assets/images/Slide5.png" alt="" /></p>
<p>Now to create ROC curve we have to do the following</p>
<ol>
<li>Import <code class="language-plaintext highlighter-rouge">roc_curve</code> from sklearn.metrics</li>
<li>Use <code class="language-plaintext highlighter-rouge">roc_curve()</code> function with following two arguments
<ol>
<li>
<p><strong>y_true array, shape = [n_samples]</strong></p>
<p>True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.</p>
</li>
<li>
<p><strong>y_score array, shape = [n_samples]</strong></p>
<p>Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).</p>
</li>
</ol>
</li>
<li>
<p>Now to calculate <em>y_score</em> we will use probability estimates , that we can get by useing <code class="language-plaintext highlighter-rouge">.predict_proba()</code> method on the test set , it will give output an array with two columns , first column is estimate and second column is probability , that is our y_score ,to get that we will subset that and take only second column by <code class="language-plaintext highlighter-rouge">[:,1]</code></p>
</li>
<li>Further <code class="language-plaintext highlighter-rouge">roc_curve</code> will have three output , we will store those variable , <strong>FPR , TPR and thresholds</strong></li>
<li>After that we will import matploblib.pyplot with alias plt and use those <code class="language-plaintext highlighter-rouge">.plot</code> to plot ROC curve</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_curve</span>
<span class="n">y_score</span> <span class="o">=</span> <span class="n">LogReg_MODEL</span><span class="p">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">y_score</span> <span class="o">=</span> <span class="n">y_score</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span> <span class="c1">#Subsetting only first column
</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">thresholds</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_score</span><span class="p">)</span>
<span class="c1"># Now to plot the ROC curve
</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="s">'k--'</span><span class="p">)</span> <span class="c1"># to plot the dashed diagonal
</span><span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'False Positive Rate'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True Positive Rate or Recall'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span> <span class="c1"># to show the plot
</span></code></pre></div></div>
<p><img src="/assets/images/output_53_0.png" alt="png" /></p>
<p>Now we want performance metrics for model , and that is AUC, to calculate auc we just need to import <code class="language-plaintext highlighter-rouge">roc_auc_score</code> and pass the same as we passed to the <code class="language-plaintext highlighter-rouge">roc_curve</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">roc_auc_score</span>
<span class="k">print</span><span class="p">(</span><span class="n">roc_auc_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_score</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.9903563941299791
</code></pre></div></div>
<h1 id="tuning-the-model">Tuning the Model</h1>
<h2 id="hyperparameters">Hyperparameters</h2>
<p>Hyperparameters are the parmeters of the learning algorithm model , as for the value k in k-Nearest Neighbor model is hyperparemeter or <em>tuning parameter</em> in ridge and lasso regression etc. For finer model we have to tune hyperparameters to the best setting.There are not any cut and clear to go for to do hyperparameter tuning. One of the philosphy is to randomly select hyperparameters and train and test and choose one which is better. Manually fidding hyperparameters then doing the whole lot of training and testing is a tedious job to do , so Scikit Learn have a GridSearchCV to help us.</p>
<h2 id="grid-search">Grid Search</h2>
<p>GridSearchCV uses cross validation so that the a hyerparameter selection is not effected by train test split.The class GridSearchCV takes the following attribute</p>
<ol>
<li>Model , initialized model for fitting</li>
<li><code class="language-plaintext highlighter-rouge">param_grid</code> a dictionary or a list of dictionary , this is the manual values of the hyperparameters we want to feed in</li>
<li><code class="language-plaintext highlighter-rouge">cv</code> number of folds for cross validation</li>
</ol>
<p>Let us tune the tuning parameter for Ridge regression model, using boston dataset</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
<span class="n">knn_model</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">()</span>
<span class="n">param_grid</span> <span class="o">=</span> <span class="p">{</span><span class="s">'n_neighbors'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">30</span><span class="p">,</span><span class="mi">40</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">60</span><span class="p">,</span><span class="mi">70</span><span class="p">,</span><span class="mi">80</span><span class="p">,</span><span class="mi">90</span><span class="p">,</span><span class="mi">100</span><span class="p">]}</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">target</span>
<span class="n">knn_modelGridSearch</span><span class="o">=</span><span class="n">GridSearchCV</span><span class="p">(</span><span class="n">knn_model</span><span class="p">,</span><span class="n">param_grid</span><span class="p">,</span><span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">knn_modelGridSearch</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">knn_modelGridSearch</span><span class="p">.</span><span class="n">best_score_</span> <span class="p">,</span> <span class="n">ridge_modelGridSearch</span><span class="p">.</span><span class="n">best_params_</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.9800000000000001 {'n_neighbors': 6}
</code></pre></div></div>
<p>Here we can see, we get best results with 6 nearest neighbors.</p>
<p>Now here it ends, Happy Learning <i class="far fa-laugh-beam"></i></p>Machine Learning is the art of giving computers the ability to learn from data and make decisions on their own without explicitly programmed for example The determination of benign and malign according to the tumor size Google News Selecting similar news and making a cluster of news which are related Classifying emails in the spam or not a spam Prediction of house pricing according to the number of rooms, furnishing, age, etc. Detection of where a bank transaction is fraud or not There are many more examples of machine learning, here we are going to discuss Supervised Machine Learning, There are two parts of data features and labels, features are the input for the model just like the size of tumors is if we put the size of tumors the model will tell us whether it is malign or benign, the prediction here whether malign or benign are the labels, there some types of data which does not contain labels such as a grouping of news which are related does not require any labels, but here in supervised learning, we are concerned with data labels, so loosely we can say the Machine Learning modeling with labels are known as supervised learning. For further understanding, we are going to use iris datasets, which have 4 features Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and one target variable Species This is a long dataset with labels virginica, setosa and Versicolor however we are representing only part of data so we can see in the target column we have the only setosa The realization of the target variable is known as labels however most of the data scientists use them interchangeably. The predictor variable and feature are the same thing and also known as the independent variable, while the target variable is known as the dependent variable Classification Classification is a machine learning models which classify things , such as classifying mail is spam or not , or in the iris data classifying where the plant is virginica , setosa or versicolor is the classification. First of all we gonna load our dataset using the following codes, which also imports pandas and numpy under their usual aliases. from sklearn import datasets import pandas as pd import numpy as np iris = datasets.load_iris() type(iris) sklearn.utils.Bunch We can see that iris dataset is a bunch, bunch is a datatypes which have a key value pairs, we can look at the pairs using following code print(iris.keys()) dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']) type(iris.data),type(iris.target),type(iris.target_names),type(iris.DESCR),type(iris.feature_names) (numpy.ndarray, numpy.ndarray, numpy.ndarray, str, list) We can see that iris.data and iris.target is numpy array , also target names is also an array , DESCR is string and features names is string, if we iris.data.shape and iris.target.shape we can see data has shape 150 rows and 4 columns and this is our features,we can take a look at our data by the command print(iris.data) , similarly the shape of target variable have 150 rows and 1 columns as we expected and we can look at it using print(iris.target) However our target variable is encoded where 0 represent setosa 1 represent versicolor 2 represent virginica It can be seen using iris.targets and it is also described in iris.descr, let us store iris.data in variable X and iris.target in y X = iris.data y = iris.target Let us construct a dataframe from the X which have header as iris.feature_names and show how our dataframe actuaaly looks like using head() method df = pd.DataFrame(X , columns = iris.feature_names) df.head() sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 k-Nearest Neighbours Now let us train our first model using the k-Nearest Neighbors (or kNN), it is quite simple, first suppose there are only two features in our dataset then we can plot each observation (that is a single row in a dataset ) simply on the 2D plane as a point where the first feature is on the x-axis and second feature on the y-axis, and suppose the color of the point is a label that can be red or blue, suppose we get a feature with know label on it only with two features now we can plot that point on the same 2D plane but we cannot determine the color of the point since it is not labeled, now we have to predict label suppose we take 3 nearest observation on the plane then it is kNN with k=3 now we have to take the majority vote of 3 nearest neighbors, 2 of them is blue so our prediction is blue, our prediction may change with change in k, suppose k=5 now out of the 5 nearest neighbors 3 are red and 2 are blue then we predict red This algorithm can be extended to n features where n number of features is greater than 2, by plotting the points in an n-dimensional euclidean plane and then computing the nearest neighbors Training and Prediction In Scikit Learn there are two important methods .fit that will be useful for training the model and .predict to predict the label using a trained model, now to use kNN we have to import sklearn.neighbors from sklearn library using from import KNeighborsClassifier and then we have to initialize it and set the value for k let set it to 5 using KNeighborsClassifier(n_neighbors=5) then we will fit the data using .fit method from sklearn.neighbors import KNeighborsClassifier knn_model = KNeighborsClassifier(n_neighbors=5) #Storing the model in varible knn_model knn_model.fit(X,y) #Fitting ot training the model KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform') Now we have trained our model and stored it into the variable knn_model , Now if we have given the sepal length, width, and petal length, we can predict the species let us predict for 4.6,3.8,3.7,0.9 as sepal length, sepal width, petal length and petal width using .predict method knn_model.predict([[4.4,3.8,3.7,0.9]]) array([1]) Hence we can see we have predicted 1 which represents Versicolor similarly we can do a lot of prediction at once by creating a NumPy array and then passing it as an argument to the knn_model.predict(), we must take care that the number of columns is equal to the number of features that we have used to train the model, now let us see an example array = np.array([[4.4,3.8,3.7,0.9], [3.2,5.7,2.0,1.3], [5.5,1.9,2.8,4.7], [3.2,9.7,6.2,1.0]]) prediction=knn_model.predict(array) print(prediction) [1 0 2 2] iris.target_names[prediction] Now we can get decoded species name by passing the prediction to iris.target as an index array(['versicolor', 'setosa', 'virginica', 'virginica'], dtype='<U10') Measuring the performance Now we have trained or model , now we must measure the performance of our model to get the idea of how good or how bad is our model,there are various metric to measure the performance such as Accuracy , Precision , F-Measure etc. but one of the question we have is to which data to use for calculating performance since the data used for training will give too optimistic metric , and may be good only for the data that we have used for training however our main target in machine learning models to train the data such that is predicts the labels for new data, so we need to calculatr our metric on the new data but that is not possible since new data will not be labeled , so a typical operating procedure for a datascientists to split the data into train and test sets where train set will be used for training and the test set will be used for testing and so on calulating the metric such as Accuracy and all we are going to use accuracy here that is equal to the total true prediction divided by toal number of prediction , suppose we 100 observation in test sets and out of them our model predicte 75 of them true , that means there are 75 prediction ehic are right and 25 are wrong so at las t we can say accuracy Splitting the dataset into Train and Test sets To split the dataset, first of all, we will import train_test_split from sklearn.model_selection, now the method train_test_split() will take some arguments, the first argument will be feature data and the second will be labels and that will be train_test_split(X,y) however this method will work fine, but to increase the usability of method it can take more arguments such as test_size which is a proportion of the test set, default is set to 0.25, which means it will split 25% of the data as a test set and 75% train set however if someone wants test set to be 20% and train set 80% they can use test_size = 0.2 random_state it is the seed for the generation of random numbers, look the train_test_split method split the dataset randomly, it does not just take 25% data from the data for the test set, it randomly selects data for test sets, suppose we want in future to generate same train and test set in future for our datasets we can generate same test and train dataset using the same random_state stratify argument is “y” if we want our test set to have the same proportion of labels as our dataset, this argument stratify dataset according to the labels, suppose in our iris dataset there are three labels setosa, Versicolor and Virginica Now in this case our dataset will be split into three datasets first containing only those observation whose label is setosa, second with Versicolor and the third with virginica, then it will take 25% of them, randomly from all of them and merge them to create test set, in this way we know the proportion of setosa, Versicolor and viginica is same in the test set and iris dataset Let us talk about the output of the train_test_split method it will give four arrays, the feature of the train set, the feature of the test set, labels of the train set and labels of the test sets, lets split our dataset, and train our model on the training set from sklearn.model_selection import train_test_split X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y) knn_model.fit(X_train,y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform') Now we will use the trained model to predict the labels of test set X_test X_test_prediction = knn_model.predict(X_test) X_test_prediction array([2, 1, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 1, 2, 1, 0, 2, 0, 1, 0, 1, 2, 2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2]) Further we use .score method to calculate Accuracy , this method will take arguments the test set and labels of the test sets knn_model.score(X_test,y_test) 0.9736842105263158 Hence we can see that we have about 97% Accuracy. Here we have used k=5, but question what will happen if we increase k. kNN models create a decision boundary, which divides the whole euclidean space into different regions where the number of regions is the number of classes, in our example kNN will divide the 4-dimensional euclidean space(4 dimensional because there are four features) into 3 regions and any new data label will be decided upon in which region it falls, our question here is what happens it we increase k so as we increase our decision boundary will smoothen. Photo credit: An Introduction to Statistical Learning with Applications in R (Available for FREE!!! ) As we can see for k=1 our decision boundary represented by black is too much fitted as k=100 we can see that our decision boundary is too much smoothed. So if k is large the decision boundary will be smoother hence a less complex model however for small k the decision boundary will less smooth and give a complex model, which will be more sensitive to the noise in the data, which may give a good prediction for training data but may fail on new data, this is also known as overfitting if we increase k too much the decision boundary will be too much smoothed (tend to become straight line) and may not perform well on both of the test and train set as can see in the figure for k=100 and this is commonly known as underfitting so we must choose k such that neither it is under fitted nor overfitted that means choose k neither too large neither too small, for k=10 we will get following Photo credit: An Introduction to Statistical Learning with Applications in R (Available for FREE!!! ) Confusion Matrix Accuracy is not always a good metric for measuring the performance of classification problems, suppose we have data for transactions from a bank and we have to create a model which classify whether a transaction is a fraud or not fraud, usually a lot of transactions are non-fraudulent let us say 95% are not fraudulent, this type of data is known as imbalanced data when one of the class is too frequent and for imbalanced data out accuracy metric does not perform well for imbalanced data, so there are other metrics to measure the performance of a model and they can be obtained from a very famous matrix known as Confusion Matrix In Binary Classification there are two classes Positive and Negative, we call those classes positive class which we are interested in, suppose we want to model a transaction fraud then we are interested in the transactions which are fraud, then the class fraud is positive class and non-fraud class is negative, Various Metric can be calculated by By the Following Formulas F1- Score can also be interpreted as Harmonic Mean of Precision and Recall , and given by \[F1 \ Score = 2 \cdot \frac{precision \cdot recall}{precision + recall}\] Confusion Matrix can be calculated import confusion_matrix method from sklearn.metrics use confusion_matrix ,with a first argument actual test labels and second argument as prediction of the lebels from sklearn.metrics import confusion_matrix print(confusion_matrix(y_test,X_test_prediction)) [[13 0 0] [ 0 13 0] [ 0 1 11]] Here we got $3\times 3 $ matrix because we have 3 classes , we are not limited to only two class positive negative, here we have three classes of labels i.e ‘versicolor’, ‘setosa’, ‘virginica’, now to get the performance metrics we have to run the following codes from sklearn.metrics import classification_report print(classification_report(y_test,X_test_prediction)) precision recall f1-score support 0 1.00 1.00 1.00 13 1 0.93 1.00 0.96 13 2 1.00 0.92 0.96 12 accuracy 0.97 38 macro avg 0.98 0.97 0.97 38 weighted avg 0.98 0.97 0.97 38 Regression In regressions target variable is a continuous variable as price of a mobile,temperature and etc. To get started let us took diabetese dataset , which is already persent in sklearn module from sklearn import datasets import pandas as pd import numpy as np boston = datasets.load_boston() Let us take a look at what we have imported in data variable using .keys() attribute boston.keys() dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename']) Now we have data and the feature names , so we can create a dataframe from the data and feature names and can take a look at the from the head method, as we have done in Classification X = boston.data y = boston.target df = pd.DataFrame(X , columns = boston.feature_names) df.head() CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 Before Training the model let us split our data , we can not use stratify attribute here because our target varible is not categorical. X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 ) Linear Regression When we assume the target variable y is a linear function of columns of X, or we can say linear functions of features the model is known as linear regression, it can be represented as \[\hat{y}_i = \sum_{i=0}^p a_{i}x^{i}\] Linear regression is an equation of line,and $a_i$ are known as parameters of linear regression. Now our main aim is to set $ a_i $ such as the predicted value of y generally represented by is nearest to the actual value of y, to measure the amount of difference between the predicted and actual we use loss functions, these are a special type of functions which give 0 when the predicted value for the label is equal to the actual label, one of the most common loss function is squared error loss function given by \[Loss(\hat{y} ;y)= \sum_{i=0}^n(y_i - \hat{y}_i)^{2}\] So our problem is to reduce loss, so to reduce we have to set optimized parameters which reduce loss, but for this type of loss the Estimation of the parameters that are $a_i$ are known as a least square estimate, for a different type of loss functions we can get different types estimate,but least square estimates are most used so we are gonna discuss this p is the number of features , hence there will be (p+1) parameters , where we added 1 due to the fact that we have to also estimate the constant term $a_0$ and “n” is the number of observations , or we can say number of rows in the dataset Now to fit the model , we will run the following code from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train,y_train) LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) Now we can predict using .predict method as follows prediction=model.predict(X_test) As we have seen the metric to measure the performance of a model is Accuracy in the classification section however for regression we can not use Accuracy one of the mostly used metric for regression is $R^2$ which is defined as proportion of variability in Y that can be explained using X , it can be calculated by following code model.score(X_test,y_test) 0.736994702163782 Generally $R^2$ range vary from 0 to 1. When $R^2$ is near to 1 it represent that the model is good and when it is near to 0 the model fitted is not good Cross Validation Cross-Validation is a method that reduces our dependency on how the data splits in train and test, there may be, only by chance that our performance metrics are representing our model as good, this is due to the fact we do not use all the data to calculate performance metrics. To eradicate the dependency on only one train test splits data we use Cross-Validation, or we can say k-fold Cross-validation where is k is a parameter and a positive integer suppose k=5 means there are 5-fold cross-validation, it simply divide our observations in our dataset into 5 groups commonly known as a fold, then we hold the first fold as a test set and all other folds are merged to create train set and then we calculate the performance metric we are interested in, and then do the same again by holding second fold as a test set and remaining as train set and after that calculate performance metric, this is known as a performance metric for the second split, similarly in k fold cross validation we calculate performance metric k times for k splits and further after calculating metric for every split we can calculate statistics of our interest such as the mean of these k performance metrics or mode, median or whatever statistic we want k-Fold Cross Validation is computationally expensive , since we have to do the whole process of training, prediction and metric calculation k times , following is the way to do so Import cross_val_score call the cross_val_score with arguments the model , features array , labels , number of fold suppose for 5 fold cv=5, and store it in a variable Call the statistics function such as mean or mode on the variable such as np.mean() for mean from sklearn.model_selection import cross_val_score #Importing class cross_validation_result = cross_val_score(model,X,y,cv=5) #Initializing np.mean(cross_validation_result) 0.3532759243958772 Shrinkage Method Shrinkage is also known as Regularization, In general, we estimate the parameters , but sometimes they are two large and lead to higher variance so it is advisable to shrink the parameters toward 0, it can be done in various ways two of the famous one is Ridge Regression and Lasso Ridge Regression For Ridge Regression we just edit our general Loss function as following \[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p a_i^2\] Where $\alpha \geq 0$ is a tuning parameter and $\alpha \sum_{i=1}^p a_i^2$ is known as shrinkage penalty, here we must note that we have not the term for in the shrinkage penalty, unlike Least Square Estimate here we get different sets of parameters for different value of tuning parameter, however for tuning parameter equal to zero will lead to Least Square Estimate and may have a greater chance of overfitting, and a very large tuning parameter will penalize the parameters too much which can lead to underfitting so we have to choose tuning parameter such as it optimizes our model To do Ridge Regression Import Ridge from the module sklearn.linear_model Then initialize Ridge() class , with passing the tuning parameter to alpha argument Then Train and predict as usual from sklearn.linear_model import Ridge ridge_model = Ridge(alpha= 0.9 ) ridge_model.fit(X_train,y_train) ridge_model.score(X_test,y_test) 0.7345197081669743 Lasso Regression Ridge Regression has a demerit that it shrinks the parameters towards 0, but never set the parameters equal to 0, there may be some features which don’t explain any variance in the label that coefficient needs to be set equal to zero, to increase the model interpretation. For Lasso, we just add modulus of the parameters at the place of the square of parameters as in Loss of Ridge Regression \[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p |{a_i}|\] Lasso shrinks the coefficient of feature to 0 for the features which are less important Lasso Regression have similar codes scrips as ridge Regression from sklearn.linear_model import Lasso lasso_model = Ridge(alpha= 10) lasso_model.fit(X_train,y_train) lasso_model.score(X_test,y_test) 0.7257977554026047 Logistic Regression Logistic Regression, despite its a regression it is used in classification problem mostly, it finds out the probability that a given observation belongs to a particular class if it is greater than 0.5 or we can say 50%, then our model predict the observation label belong to that class, It estimates the probability using the following function \[p = \sigma\left(\sum_{i=0}^p a_ix^i\right)= \frac{1}{1+ e^{-\sum_{i=0}^p a_ix^i}}\] But we will not go in theory too much, and focus on practical use. To Use Logistic Regression, it is similar to the work we have done earlier, import function, import data, split data, then test you, models, using performance metrics Let us do that, let us do this on breast cancer data, that is already available in sklearn module from sklearn import datasets import pandas as pd import numpy as np bcancer = datasets.load_breast_cancer() #Loading Data from sklearn.linear_model import LogisticRegression #Importic class for logistic regression LogReg_MODEL = LogisticRegression() #Initializing Logistic Regression class X = bcancer.data y = bcancer.target X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y) #Splitting Data LogReg_MODEL.fit(X_train,y_train) #Training the model AccuracyLogReg = LogReg_MODEL.score(X_test,y_test) print(AccuracyLogReg) 0.958041958041958 ROC Curve ROC Curve is short form for receiver operating characteristic curve. Threshold We generally take threshold 0.5 that means in kNN when the number of a particular class label is greater than 0.5 of the total class label we predict it belongs to that class label , suppose we fitted kNN for k=100 , and we have two class label red and blue then we will predict red for an observation if more than 50 of the neighbors are red that 50 is the threshold number , that is number of red neighbors to classify it as red , that 50 is 0.5$\times$ 100 , so here we have threshold 0.5 , similarly in logistic regression p=0.5 is threshold in general True Positive Rate and False Positive Rate (TPR and FPR) True Positive Rate is also known as Recall and false positive rate is given by \[FPR = \frac{FP}{FP+TN}\] Model always do not perform well when the threshold is 0.5 sometimes , model performs better with threshold other than 0.5 ,to know that we use ROC curve , ROC curve is a graph between TPR and FPR and for different threshold we get different ROC curve When threshold is 0 , means we will predict all the observation as positive and then TPR will be equal to 1 and false positive rate will also be 1 When threshold is 1 , both TPR and FPR will be equal to 0 To know how good is or model we use the area under curve (AUC) as a performance metric for ROC curve, lets say we have a perfect classifying model then TPR will be equal to 1 and FPR will be equal to 0 , this will be when area under the curve equal to 1 , so we can use AUC ROC as a performance metrics Now to create ROC curve we have to do the following Import roc_curve from sklearn.metrics Use roc_curve() function with following two arguments y_true array, shape = [n_samples] True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given. y_score array, shape = [n_samples] Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). Now to calculate y_score we will use probability estimates , that we can get by useing .predict_proba() method on the test set , it will give output an array with two columns , first column is estimate and second column is probability , that is our y_score ,to get that we will subset that and take only second column by [:,1] Further roc_curve will have three output , we will store those variable , FPR , TPR and thresholds After that we will import matploblib.pyplot with alias plt and use those .plot to plot ROC curve from sklearn.metrics import roc_curve y_score = LogReg_MODEL.predict_proba(X_test) y_score = y_score[:,1] #Subsetting only first column fpr, tpr, thresholds = roc_curve(y_test, y_score) # Now to plot the ROC curve import matplotlib.pyplot as plt plt.plot(fpr, tpr, linewidth=1) plt.plot([0, 1], [0, 1], 'k--') # to plot the dashed diagonal plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate or Recall') plt.show() # to show the plot Now we want performance metrics for model , and that is AUC, to calculate auc we just need to import roc_auc_score and pass the same as we passed to the roc_curve from sklearn.metrics import roc_auc_score print(roc_auc_score(y_test, y_score)) 0.9903563941299791 Tuning the Model Hyperparameters Hyperparameters are the parmeters of the learning algorithm model , as for the value k in k-Nearest Neighbor model is hyperparemeter or tuning parameter in ridge and lasso regression etc. For finer model we have to tune hyperparameters to the best setting.There are not any cut and clear to go for to do hyperparameter tuning. One of the philosphy is to randomly select hyperparameters and train and test and choose one which is better. Manually fidding hyperparameters then doing the whole lot of training and testing is a tedious job to do , so Scikit Learn have a GridSearchCV to help us. Grid Search GridSearchCV uses cross validation so that the a hyerparameter selection is not effected by train test split.The class GridSearchCV takes the following attribute Model , initialized model for fitting param_grid a dictionary or a list of dictionary , this is the manual values of the hyperparameters we want to feed in cv number of folds for cross validation Let us tune the tuning parameter for Ridge regression model, using boston dataset from sklearn.model_selection import GridSearchCV knn_model = KNeighborsClassifier() param_grid = {'n_neighbors' : [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100]} X = iris.data y = iris.target knn_modelGridSearch=GridSearchCV(knn_model,param_grid,cv=10) knn_modelGridSearch.fit(X,y) print(knn_modelGridSearch.best_score_ , ridge_modelGridSearch.best_params_) 0.9800000000000001 {'n_neighbors': 6} Here we can see, we get best results with 6 nearest neighbors. Now here it ends, Happy LearningTensorflow Hello World2020-06-23T00:00:00+00:002020-06-23T00:00:00+00:00https://www.iroblack.com/Tensorflow%20Hello%20World<p>Tensorflow is made up of two words tensor and flow , where tensor means multidimensional array and flow means graph of operations. It is developed by google brains team. It is released under Apache 2.0 license. It is a package in python and concurrently spreading in other languages such as R , Julia etc. Tensorflow have very smooth learning curve and it is easy for newcomers to grasp vast machine learning easily.</p>
<p>To get started we have download python anaconda version , that will automatically install jupyter notebook and then install Tensorflow to so this read our article <a href="/2020-06-22-how-to-install-python-anaconda-distribution-and-start-using-tensorflow-on-windows/">here</a></p>
<blockquote>
<p>You have to type Shift+Enter to run a cell in jupyter notebook</p>
</blockquote>
<p>We will use <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> dataset , which is developed by <a href="http://yann.lecun.com/">Yann LeCun</a>, Courant Institute, NYU <a href="http://homepage.mac.com/corinnacortes/">Corinna Cortes</a>, Google Labs, New York and <a href="http://research.microsoft.com/en-us/people/cburges/">Christopher J.C. Burges</a>, Microsoft Research, Redmond , in the dataset there are 60,000 images of handwritten digits and labeled them for training, and 10,000 images for testing. MNIST dataset is already divided in test and train set so we do not have to take care of that .</p>
<p>First of all we have to import tensorflow , with an alias <code class="language-plaintext highlighter-rouge">tf</code> to use tensorflow</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
</code></pre></div></div>
<p>Now we will import dataset and store it</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">x_train</code> the trainig images</li>
<li><code class="language-plaintext highlighter-rouge">y_train</code> label of the training images</li>
<li><code class="language-plaintext highlighter-rouge">x_test</code> testing images</li>
<li><code class="language-plaintext highlighter-rouge">y_test</code> label of the testing images</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mnist</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">datasets</span><span class="p">.</span><span class="n">mnist</span>
<span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">),</span> <span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span> <span class="o">=</span> <span class="n">mnist</span><span class="p">.</span><span class="n">load_data</span><span class="p">()</span>
</code></pre></div></div>
<p>Now let us take a look at first image of the handwritten digit</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span>
<span class="n">pyplot</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">x_train</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">pyplot</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/images/output_5_0.png" alt="png" /></p>
<p>Now we will divide <code class="language-plaintext highlighter-rouge">x_train</code> and <code class="language-plaintext highlighter-rouge">x_test</code> by 255, because our image is RGB so each pixel in our image can take any value between 0 to 255 , and neural networks works fine with range from 0 to 1 , so to normalize our dataset in 0 to 1 we divide both train and tes dataset by 255</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x_train</span><span class="p">,</span> <span class="n">x_test</span> <span class="o">=</span> <span class="n">x_train</span> <span class="o">/</span> <span class="mf">255.0</span><span class="p">,</span> <span class="n">x_test</span> <span class="o">/</span> <span class="mf">255.0</span>
</code></pre></div></div>
<p>Now we need to create a model we will use <code class="language-plaintext highlighter-rouge">tf.keras.models.Sequential</code> to create a model and we will use four layers in it from the module <a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers"><code class="language-plaintext highlighter-rouge">tf.keras.layers</code></a>, layers are as follows</p>
<ul>
<li>
<p><code class="language-plaintext highlighter-rouge">tf.keras.layers.Flatten</code> : it will flatten our data , our image is 28 $\times$ 28 pixels , it will flatten the image and convert it into 784 $\times$ 1 , it will take argument <code class="language-plaintext highlighter-rouge">input_shape</code> which will be a tuple that define the shape of our input data</p>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">tf.keras.layers.Dense</code> : it is just a layer with units here 128, and a activation function here <strong>relu</strong></p>
</li>
<li><code class="language-plaintext highlighter-rouge">tf.keras.layers.Dropout</code> : This layer drop input with a probability of <code class="language-plaintext highlighter-rouge">rate</code>(here 0.2) and multiply each non dropped input by $\frac{1}{1-rate}$</li>
<li><code class="language-plaintext highlighter-rouge">tf.keras.layers.Dense</code> : it is similar to the second layer</li>
<li><code class="language-plaintext highlighter-rouge">tf.keras.layers.Softmax</code>: it is used because output of the dense layer will be log-odds , softmax function maps logodds to probabilities</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">models</span><span class="p">.</span><span class="n">Sequential</span><span class="p">([</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Flatten</span><span class="p">(</span><span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">)),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Softmax</span><span class="p">()</span>
<span class="p">])</span>
</code></pre></div></div>
<p>So for a input i.e 28 $\times$ 28 image , in the model, it gives us an array of 10 floating point number that will be output of the last dense layer.Let us get a output without training the model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x_train</span><span class="p">[:</span><span class="mi">1</span><span class="p">]).</span><span class="n">numpy</span><span class="p">()</span>
<span class="n">predictions</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>££ array([[0.09967916, 0.09987953, 0.09993076, 0.10024416, 0.10007039,
££ 0.10004147, 0.10008495, 0.09998867, 0.10009976, 0.09998112]],
££ dtype=float32)
</code></pre></div></div>
<p>Now we can check our model using <code class="language-plaintext highlighter-rouge">model.summary()</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">summary</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>££ Model: "sequential_5"
££ _________________________________________________________________
££ Layer (type) Output Shape Param #
££ =================================================================
££ flatten_7 (Flatten) (None, 784) 0
££ _________________________________________________________________
££ dense_13 (Dense) (None, 128) 100480
££ _________________________________________________________________
££ dropout_7 (Dropout) (None, 128) 0
££ _________________________________________________________________
££ dense_14 (Dense) (None, 10) 1290
££ _________________________________________________________________
££ softmax_1 (Softmax) (None, 10) 0
££ =================================================================
££ Total params: 101,770
££ Trainable params: 101,770
££ Non-trainable params: 0
££ _________________________________________________________________
</code></pre></div></div>
<p>Now we will define a loss function , we should choose loss function such as if our model predict wrong label our loss will , tensorflow have a lots of inbuilt loss function , here we will use Sparse Categorical Cross entropy, for in detailed description check <a href="https://keras.io/api/losses/probabilistic_losses/#sparse_categorical_crossentropy-function">here</a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss_fn</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">losses</span><span class="p">.</span><span class="n">SparseCategoricalCrossentropy</span><span class="p">()</span>
</code></pre></div></div>
<p>Now our main target is to set <code class="language-plaintext highlighter-rouge">trainable params</code> such that we get minimum loss , and we are using accuracy to measure the performance it can be calculated by dividing true class prediction by total predictions , we will use <code class="language-plaintext highlighter-rouge">adam</code> as a optimizer</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="n">loss_fn</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
</code></pre></div></div>
<p>Now let us train our model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>££ Epoch 1/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1909 - accuracy: 0.9445
££ Epoch 2/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1854 - accuracy: 0.9465
££ Epoch 3/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1821 - accuracy: 0.9471
££ Epoch 4/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1773 - accuracy: 0.9493
££ Epoch 5/5
££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1741 - accuracy: 0.9498
</code></pre></div></div>
<p>Now let us evaluate our model on test set</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">evaluate</span><span class="p">(</span><span class="n">x_test</span><span class="p">,</span> <span class="n">y_test</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>££ 313/313 - 1s - loss: 0.1480 - accuracy: 0.9569
££ [0.14800186455249786, 0.9569000005722046]
</code></pre></div></div>
<p>As we can see we have approx ~95% accuracy</p>
<p>Here we have created a model , trained it , make predictions on it.</p>
<p>Happy Learning <strong><i class="far fa-laugh-beam"></i></strong></p>Tensorflow is made up of two words tensor and flow , where tensor means multidimensional array and flow means graph of operations. It is developed by google brains team. It is released under Apache 2.0 license. It is a package in python and concurrently spreading in other languages such as R , Julia etc. Tensorflow have very smooth learning curve and it is easy for newcomers to grasp vast machine learning easily. To get started we have download python anaconda version , that will automatically install jupyter notebook and then install Tensorflow to so this read our article here You have to type Shift+Enter to run a cell in jupyter notebook We will use MNIST dataset , which is developed by Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York and Christopher J.C. Burges, Microsoft Research, Redmond , in the dataset there are 60,000 images of handwritten digits and labeled them for training, and 10,000 images for testing. MNIST dataset is already divided in test and train set so we do not have to take care of that . First of all we have to import tensorflow , with an alias tf to use tensorflow import tensorflow as tf Now we will import dataset and store it x_train the trainig images y_train label of the training images x_test testing images y_test label of the testing images mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() Now let us take a look at first image of the handwritten digit from matplotlib import pyplot pyplot.imshow(x_train[0]) pyplot.show() Now we will divide x_train and x_test by 255, because our image is RGB so each pixel in our image can take any value between 0 to 255 , and neural networks works fine with range from 0 to 1 , so to normalize our dataset in 0 to 1 we divide both train and tes dataset by 255 x_train, x_test = x_train / 255.0, x_test / 255.0 Now we need to create a model we will use tf.keras.models.Sequential to create a model and we will use four layers in it from the module tf.keras.layers, layers are as follows tf.keras.layers.Flatten : it will flatten our data , our image is 28 $\times$ 28 pixels , it will flatten the image and convert it into 784 $\times$ 1 , it will take argument input_shape which will be a tuple that define the shape of our input data tf.keras.layers.Dense : it is just a layer with units here 128, and a activation function here relu tf.keras.layers.Dropout : This layer drop input with a probability of rate(here 0.2) and multiply each non dropped input by $\frac{1}{1-rate}$ tf.keras.layers.Dense : it is similar to the second layer tf.keras.layers.Softmax: it is used because output of the dense layer will be log-odds , softmax function maps logodds to probabilities model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10), tf.keras.layers.Softmax() ]) So for a input i.e 28 $\times$ 28 image , in the model, it gives us an array of 10 floating point number that will be output of the last dense layer.Let us get a output without training the model predictions = model(x_train[:1]).numpy() predictions ££ array([[0.09967916, 0.09987953, 0.09993076, 0.10024416, 0.10007039, ££ 0.10004147, 0.10008495, 0.09998867, 0.10009976, 0.09998112]], ££ dtype=float32) Now we can check our model using model.summary() model.summary() ££ Model: "sequential_5" ££ _________________________________________________________________ ££ Layer (type) Output Shape Param # ££ ================================================================= ££ flatten_7 (Flatten) (None, 784) 0 ££ _________________________________________________________________ ££ dense_13 (Dense) (None, 128) 100480 ££ _________________________________________________________________ ££ dropout_7 (Dropout) (None, 128) 0 ££ _________________________________________________________________ ££ dense_14 (Dense) (None, 10) 1290 ££ _________________________________________________________________ ££ softmax_1 (Softmax) (None, 10) 0 ££ ================================================================= ££ Total params: 101,770 ££ Trainable params: 101,770 ££ Non-trainable params: 0 ££ _________________________________________________________________ Now we will define a loss function , we should choose loss function such as if our model predict wrong label our loss will , tensorflow have a lots of inbuilt loss function , here we will use Sparse Categorical Cross entropy, for in detailed description check here loss_fn = tf.keras.losses.SparseCategoricalCrossentropy() Now our main target is to set trainable params such that we get minimum loss , and we are using accuracy to measure the performance it can be calculated by dividing true class prediction by total predictions , we will use adam as a optimizer model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy']) Now let us train our model model.fit(x_train, y_train, epochs=5) ££ Epoch 1/5 ££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1909 - accuracy: 0.9445 ££ Epoch 2/5 ££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1854 - accuracy: 0.9465 ££ Epoch 3/5 ££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1821 - accuracy: 0.9471 ££ Epoch 4/5 ££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1773 - accuracy: 0.9493 ££ Epoch 5/5 ££ 1875/1875 [==============================] - 5s 3ms/step - loss: 0.1741 - accuracy: 0.9498 Now let us evaluate our model on test set model.evaluate(x_test, y_test, verbose=2) ££ 313/313 - 1s - loss: 0.1480 - accuracy: 0.9569 ££ [0.14800186455249786, 0.9569000005722046] As we can see we have approx ~95% accuracy Here we have created a model , trained it , make predictions on it. Happy LearningIntroduction To Non-Informative Priors2020-06-01T13:24:23+00:002020-06-01T13:24:23+00:00https://www.iroblack.com/Introduction-to-noninformative-priors<blockquote>
<p>Prior density is denoted by $g(.)$ in this article</p>
</blockquote>
<h5 id="introduction">Introduction</h5>
<p>Non-Informative Priors are the priors which we assume when we do not have any belief about the parameter let say $ \theta $ . This leads noninformative priors to not favor any value of $ \theta $ , which gives equal weights to every value that belongs to $\Theta$. for example let us we have three hypothesis , so the prior which attach weight of $ \frac{1}{3}$ to each of the hypothesis is noninformative prior.
<!--more--></p>
<blockquote>
<p><strong>Note : most of the noninformative priors are improper.</strong></p>
</blockquote>
<h6 id="an-example">An Example</h6>
<p>Now let us assume a simple example let us assume our parameter space $\Theta$ is a finite set containing n elements such as</p>
\[{\theta_1,\theta_2,\theta_3,\theta_4....\theta_n} \ \in \ \Theta\]
<p>Now the obvious weight given to each $\theta_i$ when we have not any prior beliefs is $\frac{1}{n}$ that gives us prior is proportional to a constant because $\frac{1}{n}$ is a constant let us say $\frac{1}{n}$=c hence we can say</p>
\[g(\theta) = c\]
<p>Now let us assume a transformation $\eta=e^{\theta} $ , that is $\theta = log(\eta)$ . If $ g(\theta)$ is the density of $\theta$ then we can write density of $\eta$ as</p>
\[g^*(\eta)=g(\theta)\frac{d\theta}{d\eta} \\
g^*(\eta)=g(log \ \eta)\frac{d \ log \ \eta }{d\eta} \\
g^*(\eta)=\frac{g(log \ \eta)}{\eta} \\
g^*(\eta) \propto \frac{1}{\eta}\]
<p>Thus if we choose prior for $\theta$ as constant , then we have to assume prior for $\eta$ as proportional to $\eta^{-1}$ to arrive at the same answer in both cases either we take $\theta $ or $\eta$ . Thus we cannot maintain consistency and assume both prior proportional to constant . This leads to the search of such noninformative priors which are invariant under transformations.</p>
<h5 id="noninformative-priors-for-location-parameter">Noninformative Priors for Location Parameter</h5>
<blockquote>
<p>A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a function of $(x - \theta)$</p>
</blockquote>
<p>Let X is a random variable with location parameter $\theta$ then density can be written as $h(x- \theta)$. Just assume instead of observing X we observed <strong>Y = X+c</strong> and let us take <strong>$\eta=\theta+c$</strong> then can see that the density of Y is given by $h(y - \eta)$. Now $(X,\theta) \ and (Y,\eta)$ have same parameter and sample space which gives us the idea that they must have same noninformative prior</p>
<p>Let $g$ and $g^*$ are noninformative priors for $(X,\theta) \ and (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors , let us assume <strong>a subset of real line A</strong></p>
\[P^g(\theta \ \in \ A ) = P^{g^*}(\eta \ \in \ A )\]
<p>Now we have assumed <strong>$\eta=\theta+c$</strong> so</p>
\[P^{g^*}(\eta \ \in \ A )=P^{g}(\theta +c \ \ \in \ A )=P^{g}(\theta \ \in \ A-c )\]
<p>which leads us to</p>
\[P^{g}(\theta \ \in \ A)=P^{g^*}(\theta \ \in \ A-c ) \tag{*}\\
\int_Ag(\theta)d\theta=\int_{A-c}g(\theta)d\theta=\int_Ag(\theta-c)d\theta\]
<p>It holds for any set A of real line , and any c on real line so it lead us to</p>
\[g(\theta)=g(\theta-c)\]
<p>Now if we take $\theta=c$ we get $g(c)=g(0)$ ,and we know it is true for all c , it leads us to the conclusion that the prior in the case of location parameter is constant functions , for simplicity most of the statistician assume it equal to 1 , $g(.) = 1$</p>
<h5 id="noninformative-priors-for-scale-parameter">Noninformative Priors for Scale Parameter</h5>
<blockquote>
<p>A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a $\frac{1}{\theta}h(\frac{x}{\theta})$ where $\theta>0$</p>
</blockquote>
<p>For example in normal distribution we $N(\mu,\sigma^2)$ , $\sigma$ is a scale parameter .</p>
<p>To get noninformative prior for Scale Parameter $\theta$ of a random variable X , instead of observing X we observe $Y = cX$ for any $c > 0 $ , let us define $\eta = c\sigma$ , so then the density of $Y $ is given by $\frac{1}{\eta}f(\frac{1}{\eta})$ .</p>
<p>Now similar to previous part here $(X,\theta)$ and $(Y,\eta)$ have same sample and parameter space , so both will have same noninformative priors. Let $g$ and $g^*$ are noninformative priors for $(X,\theta) \ and (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors</p>
\[P^g(\theta \in A)= P^{g^*}(\theta \in A)\]
<p>Here A is a subset of Positive real line, i.e $A \subset R^+$ , now putting $\eta = c\sigma$</p>
\[P^{g^*}(\eta \in A) = P^g(\theta \in \frac{A}{c}) \\
P^g(\theta \in A) = P^g(\theta \in \frac{A}{c}) \\
\int_Ag(\theta)d\theta=\int_{\frac{A}{c}}g(\theta)d\theta=\int_A\frac{1}{c}g(\frac{\theta}{c})d\theta\]
<p>so</p>
\[g(\theta)=\frac{1}{c}g(\frac{\theta}{c})\]
<p>Now taking $\theta=c$ , we get</p>
\[g(c)=\frac{1}{c}g(1)\]
<p>Now this equation is true for any value $c>0$ so , for convenience taking $g(c)=1$ , it gives us noninformative prior $g(\theta)= \frac{1}{\theta}$</p>
<blockquote>
<p>Note : It is an improper prior , $\int_0^{\infty}\frac{1}{\theta}d\theta = \infty $</p>
</blockquote>
<h5 id="flaw-and-introduction-of-relatively-location-invariant-prior">Flaw and introduction of relatively location invariant prior</h5>
<p>Now we know noninformative prior for both Scale and Location parameter, but there is flaw . The prior we get for location and scale parameter in previous part are improper priors . If two random variables have identical form , then they have same non informative priors . but the problem here is due to improper priors , noninformative priors are not unique. lets say we have an improper prior <strong>g</strong> then if we multiply <strong>g</strong> by any constant <strong>k</strong> then the resultant <strong>gk</strong> will give same bayesian decisions as <strong>g</strong>.</p>
<p>Now in previous parts we have assumed two priors $g$ and $g^* $ , but we do not need that , we can get $g^*$ by just multiplying $g$ by a constant and vice-versa.</p>
<p>Now equation $(*)$ can be written as</p>
\[P^g(A)=l(k)P^{g}(A-c)\]
<p>Where $l(k)$ is some positive function ,</p>
\[\int_Ag(\theta)d\theta=l(k)\int_{A-c}g(\theta)d\theta=l(k)\int_Ag(\theta-c)d\theta\]
<p>It holds for all A , so $g(\theta)=l(k)g(\theta-c)$ , and taking $\theta=c$ give us $l(k)=\frac{g(c)}{g(0)}$ , putting this value back will give us</p>
<p>\(g(\theta-c)=\frac{g(0)g(\theta)}{g(c)} \tag{**}\)
Now there is a lot of prior other than $g(\theta)=c$ , which satisfy equation (** ) , so any prior of this form will be know as <em>relatively location invariant</em></p>Prior density is denoted by $g(.)$ in this article Introduction Non-Informative Priors are the priors which we assume when we do not have any belief about the parameter let say $ \theta $ . This leads noninformative priors to not favor any value of $ \theta $ , which gives equal weights to every value that belongs to $\Theta$. for example let us we have three hypothesis , so the prior which attach weight of $ \frac{1}{3}$ to each of the hypothesis is noninformative prior.