statML

fastkme : Faster Kaplan-Meier Estimator using JIT

2024-03-11T00:00:00+00:00

The nonaprametric survival models like Nearest Neighbour, Kernel Survival, COBRA Survival or adaptive nearest neighbour require fitting kaplan meier estimator, while tuning these models the kaplan meier estimator is calculated thousands of time, this motivates us to create kaplan meier estimator which is faster than the existing one provided by the scikit-survival library.

Numba one of the most popular library for speeding up the python code, it is a just-in-time compiler that translates a subset of Python and NumPy code into fast machine code. It is a very powerful tool for speeding up the python code. We have used numba to speed up the kaplan meier estimator.

Along with the faster kaplan meier estimator, for the kernelized models we need to calculate the weighted kaplan meier estimator, to solve this, using Numba we have creaated faster kaplan meier estimator and weighted kaplan meier estimator. The repository is available at fastkme

Here is how we can use, first of all we need to install libraries

%%capture
!pip install git+https://github.com/yuvrajiro/fastkme
!pip install scikit-survival

Now let us import the kaplan meier estimator from both the packages

# Importing Libraries
from fastkme.kme import kaplan_meier_estimator as proposed_kme
from sksurv.nonparametric import kaplan_meier_estimator as scikit_kme
import numpy as np

Let us see whether the proposed kaplan meier estimator is accurate, for this we will compare the proposed kaplan meier with the tried and tested scikit-survival kaplan meier estimator.

i = 0
while True:
    np.random.seed(i)
    time = np.random.randint(0,5000,100)
    event = np.random.randint(0,2,100).astype(bool)

    unique_time , surv = scikit_kme(event, time)
    unique_time2 , surv2 = proposed_kme(event, time)

    assert np.allclose(unique_time , unique_time2) , f"The unique time is not same {unique_time}, {unique_time2}"
    assert np.allclose(surv , surv2) , f"The survival probability is not same {surv}, {surv2}"

    if i > 99:
      print(f"The proposed and scikit survival kaplan mier estimator is same")
      break
    i += 1

The proposed and scikit survival kaplan mier estimator is same

Now let us see the speed-up, first we will see the speed of the scikit-survival model

%%timeit
scikit_kme(event, time)

351 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now the proposed

%%timeit
proposed_kme(event, time)

9.11 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The wighted one can be computed as follows

from fastkme.kme import kaplan_meier_estimator_w as weighted_kme

weight = np.random.rand(100)
weighted_kme(event,time,weight)

Thanks for reading, If you work on the intesection of ML and survival analysis,I would like to connect.

Probability Theory (Series)

2024-01-21T21:24:23+00:00

This series is an introduction to Probability Theory, It closely follows the book “Probability Essentials” by Jean Jacod and Philip Protter.

Day 0 : Philosphical Introduction to Probability Theory

Lets Start with a Random Experiment, A random experiment is an experiment whose outcome is not predictable with certainty. For example, tossing a coin, rolling a die, etc.

Now a random experiment consist of three things:

State Space: The set of all possible outcomes of a random experiment is called the state space, denoted by $\Omega$. For example, in the case of tossing a coin, the state space is $\Omega = {H, T}$, where $H$ denotes head and $T$ denotes tail.
Event: An event is a question about random experiment outcome whose answer is either true or false. For example, in the case of tossing a coin, the event “Head” is true if the outcome is head and false otherwise. In mathematical terms, an event is a subset of the state space $\Omega$. And since It is a set, It must adhere to set properties like union, intersection, complement, etc. From this suppose we have two event $A$ and $B$, then following are the set operations:
- Conmpliment: The compliment of an event $A$ is denoted by $A^c$ and is defined as $A^c = \Omega - A$. In the case of tossing a coin, the compliment of event “Head” is “Tail”.Hence if A is the event “Head”, then $A^c = {T}$.
  - Union: The union of two events $A$ and $B$ is denoted by $A \cup B$ and is defined as $A \cup B = {x \in \Omega : x \in A \text{ or } x \in B}$. In the case of tossing a coin, the union of event “Head” and “Tail” is the entire state space $\Omega$.
  - Intersection: The intersection of two events $A$ and $B$ is denoted by $A \cap B$ and is defined as $A \cap B = {x \in \Omega : x \in A \text{ and } x \in B}$. In the case of tossing a coin, the intersection of event “Head” and “Tail” is an empty set $\emptyset$.
  - Sure Event: The sure event is the event that is always true, It is denoted by $\Omega$.
  - Impossible Event: The impossible event is the event that is always false, It is denoted by $\emptyset$.
  - Elementary Event: The elementary event is the event that contains only one outcome, A singleton i.e a subset ${\omega}$ of the state space $\Omega$.

The family of all events is called the $\sigma$-algebra denoted by $\mathcal{A}$. The $\sigma$-algebra must satisfy the following properties:

$\Omega \in \mathcal{A}$
If $A \in \mathcal{A}$, then $A^c \in \mathcal{A}$
If $A_1, A_2, \ldots \in \mathcal{A}$, then $\cup_{i=1}^{\infty} A_i \in \mathcal{A}$ These condition ensures the point mentioned above named as compliment, union, intersection, sure event, impossible event, and elementary event.

Probability: The probability is a function that assigns a number between 0 and 1 to each event in the $\sigma$-algebra $\mathcal{A}$. Going by the conventional approach probability can be seen as limits of the freequency of occurrence of an event in a large number of trials under Identitcal Conditions. For example, the probability of getting a head in a coin toss is 0.5, which means in a large number of coin tosses, the number of heads will be half of the total number of tosses. The probability function must satisfy the following properties:
- Non-Negativity: The probability of any event is non-negative, i.e $P(A) \geq 0$.
- Normalization: The probability of the sure event is 1, i.e $P(\Omega) = 1$.
- Additivity: The probability of the union of two disjoint events is the sum of the probability of the individual events, i.e $P(A \cup B) = P(A) + P(B)$ if $A \cap B = \emptyset$. We will discuss the Proability in detail in the upcoming posts. A fourth notion that is closely related to probability is the random variable, which we will discuss in the next post.

Day 1 : Random Variables (Upcoming)

Convergence of Markov Chain

2021-04-11T00:00:00+00:00

What is Markov Chain ?

Markov Chain is a Stochastic Model in which Future is dependent only on Present not on Past , What I mean to say that is

\[P(X^{t+1}|X^t,X^{t-1},...X^2,X^1) = P(X^{t+1}| X^t)\]

Transition Probability Matrix

Let us denote

\[p_{ij} = P(X^{n+1} = i | X^n = j)\]

Where $[Math Processing Error]p_{ij}$ denotes the probability of going from state “j” to state “i” in one step, similarly we can define $[Math Processing Error]p_{ij}^n$ as the probability of going from state “j” to state “i” in n steps, we can create Transition Probability Matrix as

\[TPM = \begin{bmatrix} p_{11} \ p_{12} \ p_{13} \ . \ .\ \\ p_{21} \ p_{22} \ p_{23} \ . \ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \ . \ .\ . \ . \ .\ . \ . \ .\ .\ \\ \end{bmatrix}\]

Why Convergence of Markov Chain Important ?

Revisit MCMC

However MCMC have vast usage in the field of Statistics, Mathematics and Computer Science, here we will discuss simple problem in Bayesian Computation , and asses why convergence of Markov Chain is Important

Let us assume that we want to estimate certain parameter $t(\theta)$ and the model is given such that $g(\theta)$ is prior density for $\theta$ and $f(y | \theta)$ is likelihood of $y = {y_1,y_2 ……y_n}$ give the value of $\theta$ then the posterior can be written as

\[g(\theta | y ) \propto g(\theta)f(y|\theta)\]

Which have to be normalized , then the posterior density will be given by

\[g(\theta | y ) = \frac{g(\theta)f(y|\theta)}{\int g(\theta)f(y|\theta)d\theta}\]

For the sake of simplicity let us assume $t(\theta) = \theta$ and let us assume $\hat{\theta}$ is an estimate, then take Square Error Loss Function

\[L(\theta , \hat{\theta}) = (\theta - \hat\theta)^2\]

Then the Classical Risk will be given by $R_{\hat{\theta}}(\theta) = E_{\theta}(L(\theta,\hat{\theta}))$ and Bayes Risk is given by

\[r(\hat{\theta}) = \int R_{\hat{\theta}}(\theta)g(\theta)d\theta\]

Now our target is to minimize bayes risk to get the bayes estimate

\[\begin{align*} r(\theta) &= \int R_{\hat{\theta}}(\theta)g(\theta)d\theta \\ &= \int E_{\theta}(L(\theta,\hat{\theta})) g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)dy\right)g(\theta)d\theta \\ &= \int \left( \int (\theta - \hat\theta)^2f(y|\theta)g(\theta)dyd\theta\right) \\ &= \int \left( \int (\theta - \hat\theta)^2g(\theta|y)d\theta\right)f(y)dy \tag{1} \end{align*}\]

The equation $({1})$ can be minimized if the inner integral is minimized, when

\[\hat{\theta} = E(\theta |y)\]

Now we may not always able to calculate mean of posterior density, that means

\[\hat\theta = \int\theta g(\theta|y)d\theta\]

That is when we do not know the kernel density , and integral will be complex , then we use CLT to estimate $\theta$ that is we take random samples from the kernel $g(\theta | y)$ i.e posterior kernel , and calculate the means of the samples , that can be mathematically seen as

\[X^1,X^2......X^n \ are \ samples \ from\ g(\theta|y) \ now \\ \frac{\sum X_i}{n} \to \hat\theta \ as \ n \to \infty\]

When does Markov Chain Converge ?

Now let us take $[Math Processing Error]g(\theta | y) = \pi(\theta)$ it can be assumed because y is realized and $[Math Processing Error]g(\theta | y)$ is only function of $[Math Processing Error]\theta$ , Now comes the MCMC , if we can create a chain whose stationary distribution is $[Math Processing Error]\pi(\theta)$, then we can assume that chain as a random samples which converges to $[Math Processing Error]\pi(\theta)$ and that is the reason we need Markov Chain to converge, before we move forward let us describe some definitions

Let us denote $\pi$ as a probability measure on $(\mathcal{X},\mathcal{B})$ and $\Phi = {X^0,X^1 …}$ are discrete time Markov Chain on $(\mathcal{X},\mathcal{B})$ , let us assume transition kernel P and k as transition density and can be illustrated as

\[P(x,A) = Pr(X^{i+1} \in A | X^i = x ) = \int_A k(x,y)dy\]

that is $P(x,A)$ gives us the probability of one step transition probability from state x to any state in A, now the transition kernel assumes two linear operators

$\lambda P $ where $\lambda$ is probability distribution on $(\mathcal{X},\mathcal{B})$
Pf where f is non-negative measurable function on on $(\mathcal{X},\mathcal{B})$

where

\[\lambda P(A) = \int_{\mathcal{X}}\lambda(x)P(x,A)dx\]

so if $X^i \sim \lambda$ then $\lambda P(A)$ is the marginal distribution of $X^{i+1}\in A$ and

\[Pf(x) = \int_{\mathcal{X}}P(x,dy)f(y) = E_p[f(X_{i+1})|X_i = x]\]

and m-step transition probability is given by

\[P^m(x,A) = \int_A k^m(x,y)dy\]

Invariant Density - $\pi$

\[\pi = \pi P \\ \Rightarrow \pi(x) = \int_{\mathcal{X}}\pi(y)k(y,x)dy\]

Now there are several way to ensure $\pi$ is invariant (or stationary ) distribution one of the way is , to satisfy the balance condition i.e

\[\pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X}\]

Proof

Suppose $\pi$ satisfy the balance condition then

\[\begin{align*} \pi(x)k(x,y) = \pi(y)k(y,x) \ \ \ \ \ \ \ \ \ \ \ for \ all \ x,y \in \mathcal{X} \\ \\ \int_{\mathcal{X}}\pi(y)k(y,x)dy = \int_{\mathcal{X}}\pi(x)k(x,y)dy = \pi(x) \ \ \ \ \ \ \ \ \ \ \ \ \end{align*}\]

However Balance condition is not necessary condition it is only sufficient that means Reversibility is not required for $\pi$ to be invariant, suppose $X^i \sim \pi$ and it preserve it distribution over any number of transition , then we say that the Markov chain is stationary and hence it converges to $\pi$ that is required for MCMC

Let us Define

$\phi$-irreducible A Markov Chain is for some measure $\phi$ on $\mathcal{X},\mathcal{B}$ if for all $x \in X$ and $A \in \mathcal{B}$ for which $\phi(A) > 0$ , there exist n for which $P^n(x,A)>0$

A Chain is Aperiodic if Period is 1

Harris Recurrent A $\phi$- irreducible Markov Chain is Harris Recurrent if a $\phi$ positive set A, the chain reaches set A with probability 1

Harris Ergodic A Markov Chain is said to be Harris ergodic if it is $\phi$ irreducible , aperiodic , Harris Recurrent and posses invariant distribution $\pi$ for some measure $\phi$ and $\pi$

Total Variation Distance The Total Variation distance between two measures $\mu(.) \ and \ v(.)$ is defined by

\[|| \mu(.) - v(.)|| = sup_{A \in \mathcal{B}}|\mu(A)-v(A)|\]

What does Harris Ergodicity Guarantees ?

Guaranteed to explore entire space without getting stuck
Strong Consistency of Markov Chain Average
Convergence of Markov Chain to stationary in total Variation Distance

The following two theorems are very important for MCMC

Ergodic Theorem A Markov chain $\Phi$ is Harris ergodic with Invariant Distribution $\pi$ and $E_{\pi} | g(X) | < \infty$ for some function $g : \mathcal{X} \to \Bbb{R}$ Then for any starting value $x \in \mathcal{X}$ , then

\[\bar{g}_n = \frac{1}{n}\sum_{i=0}^{n-1}g(X^i) \to E_{\pi}g(X) \ almost \ surely \ as \ n \ \to \infty\]

and that is the main requirement that we use generally in MCMC

Birkhoff, George D. “Proof of the Ergodic Theorem.” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, 1931, pp. 656–660. JSTOR, www.jstor.org/stable/86016. Accessed 9 Apr. 2021.

The other Theorem is as follows

*Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e

\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]

further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n

Rate of Convergence

The Ergodic Theorem tells us about convergence of Markov chain however it does not declare anything about the rate of convergence, we define a Markov Chain converging at geometric rate as geometrically ergodic, i.e there exist $M:\mathcal{X} \to \Bbb{R}$ and some constant $t \in (0,1)$ that satisfy

\[||P^n(x,.)-\pi|| \leq M(x)t^n \ \ \ \ \ for \ any \ x \in \mathcal{X}\]

If M is bounded , the Markov chain is uniformally ergodic

As long as the starting value of x , such that M(x) is not large, geometric ergodicity guarantees quick convergence of Markov Chain
Geometric Ergodicity holds for every irreducible and aperiodic Markov chain on finite space

What is Needed for Geometric Ergodicity

Drift and Minorization Condition

A Type 1 drift condition holds if there exist some non-negative function $V:\mathcal{X} \to \Bbb{R}_{\geq 0}$ and constant $0 < \gamma <1$ and $L < \infty$

\[PV(x) \leq \gamma V(x) + L \ \ \ \ \ \ \ \ \ \ \ \ for \ any \ x \in \mathcal{X}\]

Further we call V a drift function and a $\gamma$ a drift rate

A Minorization condition holds on set $C \in \mathcal{B}$ if there exist some positive integer $m ,\epsilon > 0$ and probability measure Q in $(\mathcal{X},\mathcal{B})$ for which

\[P^m(x,A) \geq \epsilon Q(A)\]

we can also call this m step minorization condition, here C is called small, It imply the following condition

\[k^m(x,y) \geq \epsilon q(A)\]

Proposition

Suppose Markov chain $\Phi$ is irreducible and periodic with invariant distribution $\pi$ , Then $\Phi$ is geometrically ergodic if the following two conditions are met:

Type I drift condition hold
There exists some constants $d > 2L(1-\gamma)$ for which one step minorization condition holds on set $C= {x:V(x)\leq d}$

This Proposition is a Corollary of Rosenthal(1995a)

Let $\Phi$ be a a periodic and irreducible Markov chain with invariant distribution $\pi$

Let us suppose the Condition 1&2 of Proposition holds and $X^0 = x_0$ be the starting value and define

\[\alpha = \frac{1+d}{1+2L+\gamma d} \ \ \ \ \ \ and \ \ \ \ \ \ \ U = 1+2(\gamma d+L)\]

Then for any $r \in (0,1)$

\[||P^n(x_0 ,.) - \pi(.)|| \leq (1-\epsilon)^{rn} +\left(\frac{U^r}{\alpha^{1-r}} \right)^n\left(1 + \frac{L}{1-\gamma} + V(x_0)\right)\]

We can rearrange this to see that is satisfy geometric ergodicity condition

V(x) + 1 is proportion to M(x) hence starting point should minimize V(x)

Type II Drift Condition : If there exist some function W : $\mathcal{X} \to [1,\infty)$ finite at some x $\in \mathcal{X}$, some set $D \in \mathcal{B}$ , and constants $0 < \rho < 1$ and $b < \infty$ for which

\[PW(x) \leq \rho W(x) + bI_D(x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ for \ all \ x \in \mathcal{X}\]

It is easy to show that Type I Drift Condition $\Leftarrow\Rightarrow$ Type II Drift Condition

Finally we can say that

Suppose Markov Chain $\Phi$ is aperiodic and $\phi-$irreducible with invariant distribution $\pi$. Then $\Phi$ is geometrically ergodic if there exist some small set D, the drift function $W: \mathcal{X} \to [1,\infty)$ and some constants $0 < \rho < 1$ and $b < \infty$ for which a type II drift conditions hold

Now Let me reinstate the earlier theorem

Suppose Markov chain $\Phi$ is Harris ergodic with invariant distribution of $\pi$ Then for any starting value $x \in \mathcal{X}$ . $\Phi$ will converge to $\pi$ in total variation distance , i.e

\[||P^n(x,.) - \pi(.)|| \to 0 \ as \ n \to \infty\]

further $ | | P^n(x,.) - \pi(.)| | $ is monotonically non-increasing in n

Jain and Jamison (1967) have shown that for every $\phi-irreducible$ Markov chain on $(\mathcal{X},\mathcal{B})$ . Then there exists some small set $C \in \mathcal{B}$ for which $\phi(C) > 0$.Furthermore , the corresponding minorization measure Q(.) can be defined so that Q(C) > 0

the Jain and Jamison allow us to assume $C \in \mathcal{B}$ such that

\[P(x , A) \geq \epsilon Q(A) \ \ \ \ \ for \ all \ x \in C\]

That is one step minorization condition , Now we can write

\[P(x,A) = \epsilon Q(A) + (1-\epsilon)R(x,A) \ \ \ \ \ \ \ for \ all \ x \in C \ and \ A \in \mathcal{B}\]

Here $R(x,.)$ is probability measure for $(\mathcal{X},\mathcal{B})$ , then this allow us to construct two separate chain which couple with probability 1

\[\Phi(X) = \{X^0,X^1 ...........\} \\ \Phi(Y) = \{Y^0,Y^1 ............\}\]

Now $(X^{n},Y^n) \to (X^{n+1},Y^{n+1})$ with the following algorithm

While $X^n \neq Y^n$
1. If $(X^n,Y^n) \not\in C \times C$
  1. Draw $X^{n+1} \sim P(X^n,.)$ and$Y^{n+1} \sim P(Y^n,.)$ independently
2. If $(X^n,Y^n) \in C \times C$
  1. Draw $\delta_n \sim Bern(\epsilon)$
  2. If $\delta_{n} = 0$ , Draw $X^{n+1} \sim R(X^n,.)$ and$Y^{n+1} \sim R(Y^n,.)$ independently
  3. otherwise , draw $X^{n+1} = Y^{n+1} \sim P(x,.)$
Once $X^n = x = Y^n $,draw $X^{n+1} = Y^{n+1} \sim P(x,.)$

Now define coupling time T such that T denotes n for which first time $(X^{n-1},Y^{n-1}) \in C \times C$ and $\delta_{n-1}=1$ , once the chain couples it will remain equal

Now let us assume

\[X^0 = x \ and \ Y^0 \sim \pi\]

And $Pr_x$ denotes the probability with respect to starting point x, then $\Phi(y)$ is stationary

\[\begin{align*} |P^n(x,A) - \pi(A)| &= |Pr_x(X^n \in A) - Pr_x(Y^n \in A)| \\ &= |Pr_x(X^n \in A,X^n = Y^n) +Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n = Y^n)| \\ &= |Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)| \\ &\leq max\{Pr_x(X^n \in A,X^n \neq Y^n)- Pr_x(Y^n \in A,X^n \neq Y^n)\} \\ &\leq Pr_x(X^n \neq Y^n) \\ &= Pr_x(T > n) \end{align*}\]

Thus

\[||P^n(x,.) - \pi(.)|| \leq Pr_x(T>n)\]

Now Let us Suppose Minorization condition hold over entire space i.e $C = \mathcal{X}$ in this case every couple generated belongs to $C \times C$ for all n then

\[T \sim Geo(\epsilon) \\ P(T>n) = (1-\epsilon)^n\]

\[||P^n(x,.) - \pi(.)|| \leq (1-\epsilon)^n\]

so when C = $\mathcal{X}$ , $ | |P^n(x,.) - \pi(.) | | \to 0 \ as \ n \to \infty$

and When $C \neq \mathcal{X}$ , the distribution of $P(X>t)$ is complicated and beyond the scope of this presentation

Deterministic Update Gibbs Sampler (DUGS)

Let us assume our Target Distribution is $\pi(\theta)$ such that $\theta = (\theta_1,\theta_2….\theta_d)$

Notation : $\theta_{-i}$ is vector of parameter except $\theta_i$

Initialization : $\theta^0 = (\theta_1^0,\theta_2^0……\theta_d^0)$

Iteration: For $i \geq 1$

Sample $\theta_1^i \sim \pi(\theta_1^i | \theta^2_{-1})$
Sample $\theta_2^i \sim \pi(\theta_2 | \theta_1^i , \theta^{i-1}_{-(1,2)})$
… .. ……………………………
… … ……………………………
Sample $\theta_1^i \sim \pi(\theta_d | \theta^{i}_{-d})$

The Transition Kernel for two parameter will be given by

\[k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2)) = \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)\]

Let us check the stationarity for two parameter

\[\begin{align*} \int\int \pi(\theta_1,\theta_2)k((\theta_1,\theta_2),(\tilde{\theta}_1,\tilde{\theta}_2))d\theta_1d\theta_2 &= \int\int \pi(\theta_1,\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_1d\theta_2 \\ &= \int \pi(\theta_2)\pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \int \pi(\tilde\theta_1,\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &= \pi(\tilde\theta_1)\cdot \pi(\tilde\theta_2|\tilde\theta_1) \\ &= \pi(\tilde\theta_2,\tilde\theta_1) \\ \end{align*}\]

However this does not suffices for for the convergence, Aperiodicity needed for surety that the samples are not repeating hence leads to exploring whole space and Irreducibility confirms that it will not stuck If we are to prove the balance condition the we are assured that it will converge, Let $\Phi_i={\theta_i^0,\theta_i^1……..}$ and let $k_1(\tilde\theta_1,\theta_1)$ be the transition density in $\Phi_i$ , then

\[\begin{align*} \pi({\theta_1}) k_1(\tilde\theta_1,\theta_1) &= \pi({\theta_1})\int \pi(\tilde\theta_1|\theta_2)\cdot \pi(\tilde\theta_2|\tilde\theta_1)d\theta_2 \\ &=\pi({\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\tilde\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int \frac{\pi(\tilde\theta_1,\theta_2)}{\pi(\theta_2)}\cdot \frac{\pi(\tilde\theta_2,\tilde\theta_1)}{\pi(\theta_1)} d\theta_2\\ &=\pi({\tilde\theta_1}) \int {\pi(\tilde\theta_1|\theta_2)}\cdot {\pi(\tilde\theta_2|\theta_1)}d\theta_2\\ &= \pi({\tilde\theta_1}) k_1(\theta_1,\tilde\theta_1) \end{align*}\]

Example

Let us suppose

\[Y_1 , Y_2 ..... Ym \sim^{iid} N(\mu, \theta)\]

where $m \geq 5$ , Let us assume the joint prior density as

\[g(\mu,\theta) \propto \frac{1}{\sqrt{\theta}}\]

Let y = $(y_1,y_1 ……y_m)$ as a sample data with mean $\bar y$ and variance $s^2 = \sum(y_i - \bar y)^2$ the the posterior will be given by

\[g(\mu , \theta | y) \propto \theta^{-\frac{m+1}{2}}exp \bigg( -\frac{1}{2\theta} \sum_{j=1}^m (y_j - \mu)^2\bigg)\]

and

\[\theta | \mu,y \sim IG\left(\frac{m-1}{2}, \frac{s^2+m(\mu -\bar{y})^2}{2}\right) \\ \mu | \theta ,y \sim N(\bar y,\frac{\theta}{m})\]

We know Inverse Gamma have kernel $x^{-(a+1)}e^{-bx}$ with parameter (a,b)

Let us use DUGS Sampler in the following update scheme

\[(\theta^{'},\mu{'}) \to (\theta^{},\mu{'}) \to (\theta^{},\mu{})\]

so the kernel density will be given by

\[k((\mu^{'},\theta^{'}),(\mu,\theta)) = \pi(\theta|\mu^{'},y)\pi(\mu|\theta,y)\]

Type 1 Drift Condition

Let us define $V(\mu , \theta) = (\mu - \bar{y})^2$

\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E[V(\mu,\theta)|\mu^{'}] =E[E[V(\mu,\theta)|\theta]|\mu^{'}]\]

where

\[E[V(\mu,\theta)|\theta] = E[(\mu-\bar{y})^2|\theta] = Var[\mu|\theta] = \frac{\theta}{m}\]

Then

\[E[V(\mu,\theta)|\mu^{'},\theta^{'}] = E\left[\frac\theta m | \mu^{'}\right] \\ \Rightarrow \frac{1}{m} \frac{s^2+m(\mu^{'}-\bar{y})^2}{m-3} \\ \Rightarrow \frac{(\mu^{'}-\bar{y})^2}{m-3} \frac{s^2}{m(m-3)} \\ \Rightarrow \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]

now $m \geq 5$ guarantees that $\frac{1}{m-3} < 1$ hence

\[PV(\mu^{'},\theta^{'}) =E[V(\mu,\theta)|\mu^{'},\theta^{'}] \leq \frac{1}{m-3}V(\mu^{'},\theta{'}) + \frac{s^2}{m(m-3)}\]

So its satisfy drift condition with $\gamma \in (1/(m-3),1) $ and $L^2 =s^2/(m(m-3))$

Minorization Condition

Let us assume $C = {(\mu,\theta) : V(\mu,\theta) \leq d }$ for $d \geq 2L/(1-\gamma)$ if there exist density q and $\epsilon > 0$ for which

\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \epsilon q(\mu,\theta)\ for \ all \ (\mu^{'},\theta^{'}) \in C \ and \ (\mu, \theta) \in \Bbb{R} \times \Bbb{R}_+\] \[k((\mu^{'},\theta^{'}),(\mu, \theta)) = \pi(\mu|\theta,y)\pi(\theta | \mu^{'},y) \geq \pi(\mu|\theta,y) \inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y)\]

Let us assume $IG(a,b ; x)$ denote the density at $ x>0$

\[g(\theta) =\inf_{(\mu{'},\theta^{'}) \in C} \pi(\theta | \mu^{'},y) \\ \Rightarrow IG\left(\frac{m-1}{2},\frac{s^2}{2}+\frac{m}{2}(\mu^{'}-\bar{y})^2;\theta\right) \\ \Rightarrow \left\{ \begin{array}{c} IG(\frac{m-1}{2},\frac{s^2}{2}+\frac{md}{2} ; \theta ) \ \ if \ \theta < \theta^* \\IG(\frac{m-1}{2},\frac{s^2}{2} ; \theta ) \ \ if \ \theta \geq \theta^*\\ \end{array} \right.\]

where $\theta^{*} = md[(m-1)log(1+md/s^2)]^{-1}$

\[k((\mu^{'},\theta^{'}),(\mu, \theta)) \geq \pi(\mu | \theta,y)g(\theta) = \epsilon q(\mu,\theta)\]

Where $q(\mu , \theta) = \epsilon^{-1}\pi(\mu | \theta,y)g(\theta)$

Hence the Minorization conditions hold

Highest Posterior Density Interval

2020-11-01T00:00:00+00:00

Highest Posterior Density Interval is interval of the parmeter in which the posterir value are high when compared to any other point outside the interval (i.e. the posterior value is high in the interval). It can be defined as a 100(1-alpha)% HPD for a parameter $\theta$ is $\mathcal{C} = { \theta : \pi(\theta \vert x) \geq k }$, where k is the largest number such that

\[\int_{\theta : \pi(\theta | x) \geq k } \pi(\theta | x) \mathrm{d} \theta = 1 - \alpha\]

Here we can think as a horizontal line in the posterior distribution, where it intersect the posterior density function such that the area under the intersection and posterior density is equal to 1-alpha.

Example

Following is a my Class assignment during my masters in the fall of 2019.

Let us consider the following dataset follows an exponential distribution with scale parameter ${\theta}$.Let us consider the prior for ${\theta}$. Obtain posterior distribution, Bayes estimator, and 0.95 HPD interval for the parameter.

3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22, 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02

The density of the data model will be given by

\[f(x|\theta) = \frac{1}{\theta}e^{\frac{-x}{\theta}}\]

Let us notify $\sum_{i=1}^n x_i =S_n$ now the likelihood will be given by

\[L(x|\theta) = \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}\]

Now Since we do not have any info about $\theta$ let us assume non-informative prior

\[\pi{(\theta)} = \frac{1}{\theta}\]

Then the posterior will be given by

\[\pi{(\theta|x)} = \frac{\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}{\int_0^{\infty}\frac{1}{\theta} \cdot \left(\frac{1}{\theta}\right)^ne^{\frac{-S_n}{\theta}}}\] \[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]

Now this is the density of the Inverse Gamma so

\[\pi{(\theta | x)} \sim Inv-Gamma(n,S_n)\]

So the bayes estimate will be given by $\frac{S_n}{n-1}$

Code

xobs <- c(3.29, 7.53, 0.48, 2.03, 0.36, 0.07, 4.49, 1.05, 9.15,3.67, 2.22,
 2.16, 4.06, 11.62, 8.26, 1.96, 9.13, 1.78, 3.81, 17.02)
Bayes_Estimate = sum(xobs)/(length(xobs)-1) # Bayes Estimate
cat("Bayes Estimate of scale parameter is given by ",Bayes_Estimate)

## Bayes Estimate of scale parameter is given by  4.954737

Now HPDI will br given by

\[\int_{\theta : \pi(\theta|X) \geq k} \pi(\theta|X)d\theta = 1-\alpha\]

where $1- \alpha = 0.95$ , here it can be thought as a horizontal line is on the posterior density such that the point where the posterior density intersect this line the area between these points will be 0.95

Let us take a look at posterior density function

s = sum(xobs)
l =length(xobs)
curve(dinvgamma(x , rate = s , shape = l),from=0,to=10)

Now let us find HPD , the posterior here is given by

\[\pi{(\theta|x)} = \frac{S_{n}^n}{\Gamma(n)}{ \cdot \left(\frac{1}{\theta}\right)^{n+1}e^{\frac{-S_n}{\theta}}}\]

Code for HPDI

ruler1 <- seq(2, s/(l+1),length=3500 )  #s\(l+1) is mode of posterior
ruler2 <- seq(s/(l+1), 8 ,length = 5000)
target = 0.95 
tolerance = 0.0005
done<- FALSE
for(i in ruler1)
{
  for(j in ruler2)
  {
    if(round(dinvgamma(i,rate=s,shape = l),3)==round(dinvgamma(j,rate=s,shape = l),3))
    {
      #print(paste(i,"and",j))
      L <- pinvgamma(i,rate=s,shape=l)
      H <- pinvgamma(j,rate=s,shape=l)
      if (((H-L)<(target+tolerance)) & ((H-L)>(target-tolerance)))
      { 
        done <- TRUE
        break
      }
    }
  }
 if (done){break}
}
HPD.L <- i; HPD.U <- j
print(paste(target*100, "% HPD interval:", HPD.L, "to", HPD.U))

## [1] "95 % HPD interval: 2.94588413015964 to 7.2851736061498"

Introduction to Logistic Regression

2020-10-12T00:00:00+00:00

Usually in Linear Regression we consider $X$ as a explanatory variable whose columns are $X_1 , X_2 …..X_{p}$ are the variables which we use predict are the independent variable $y$ , we measure these values on a continuous scale,When the dependent variable y is dichotomous such as, Male or Female , Pass or Fail , Malignant or Benign.

When we have dependent variable y is a qualitative, we can indicate it by indicator variable such as

\[y = 0\ \ \ if\ female \\ y = 1 \ \ \ if \ male\]

\[y_i = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip} + \epsilon_i \ \ \ \ \ \ i = 1,2,3,........,n\]

or in the matrix form we can write

\[Y = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ . \\ . \\ y_n \\ \end{bmatrix} \ \ X = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & x_{1,3} & . &. & x_{1,p}\\ 1 & x_{2,1} & x_{2,2} & x_{2,3} & . &. & x_{2,p}\\ . & . & . & . & . & . & x_{3,p} \\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ 1 & x_{n,1} & x_{n,2} & x_{n,3} & . & . & x_{n,p}\\ \end{bmatrix} \ \ \beta = \begin{bmatrix} \beta{0} \\ \beta{2} \\ \beta{3} \\ . \\ . \\ \beta_p \\ \end{bmatrix} \epsilon = \begin{bmatrix} \epsilon{1} \\ \epsilon{2} \\ \epsilon{3} \\ . \\ . \\ . \\ \epsilon_n \\ \end{bmatrix}\]

that is

\[Y = X\beta + \epsilon\]

Remember first column of independent variable matrix X is $\underline{1}$ , for the constant $\beta_0$

Our dependent variable y , that we have to predict is indicator suppose it takes two values , assume y follows a bernoulli distribution

\[y_i = 1 \ with \ P(y_i = 1 ) = \pi_i \\ y_i = 0 \ with \ P(y_i = 0 ) = 1-\pi_i\]

Assuming $E(\epsilon_i) = 0$,

\[E(y_i) = 1 \cdot \pi_i + 0 \cdot(1 - \pi_i) = \pi_i \\ E(y_i) = X\beta = \pi\]

where

\[\pi = \begin{bmatrix} \pi_{1} & \pi_{2} & \pi_{3}& . & . \pi_{n}\\ \end{bmatrix}^{T}\]

Now we know in Linear Regression $\epsilon$ is supposed to follow normal distribution , whereas here we cannot suppose $\epsilon$ to follow normal distribution, because here it take only two discrete values

so we have $E(y_i) =\pi_{i} = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip}$ where $E(y_i) \in [0,1]$ that put bound on the expected value of y

In logistic regression we use Standard logistic function , some people call it a Sigmoid function. It can be given by

\[E(y_i) = \pi_i = \frac{1}{1+e^{-(\beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+.....+ \beta_px_{ip})}} \tag{1}\]

Our main work in logistic regression our main aim is to predict $\pi$ , the bernoulli parameter for $Y$ , and generally we took decision by $\pi_i$ greater than 0.5 or less than 0.5

Link Function

Usually every model have a link function which relates the linear predictor $ \eta_i $ to the mean response $ \mu_i $. First of all we have to understand what is linear predictor, it is a systematic component where $ \eta_i = E(y \vert x_i) $ ,So if $g( . )$ is a link function then

\[g(\mu_i ) = \eta_i \ \ or \mu_i =g^{-1}(\eta_i)\]

In the Linear regression this link is a identity link , whereas in the logistic regression $ \mu_i = E(y_i) =\pi_{i} $ so the relation between $\pi_i$ and $\eta_i = E(y \vert x_i) = \beta_0 + \beta_1x_{i1}+ \beta_2x_{i2}+…..+ \beta_px_{ip} $ is a logistic regression so

\[g(X\beta) = \pi\]

We have similar equation $\eqref{1}$ we can use that to get link function

\[\pi = \frac{exp(X\beta)}{1+exp(X\beta)} \\ X\beta=\eta = ln(\frac{\pi}{1-\pi})\]

where $\frac{\pi}{1-\pi}$ is odds and its log is known as log-odds ,this transformation is logit transformation.

It is very hard to estimate $\beta$ theoretically , so we choose gradient-descent algorithm for calculation of the parameter

Supervised Learning with Scikit Learn

2020-07-07T00:00:00+00:00

Machine Learning is the art of giving computers the ability to learn from data and make decisions on their own without explicitly programmed for example

The determination of benign and malign according to the tumor size
Google News Selecting similar news and making a cluster of news which are related
Classifying emails in the spam or not a spam
Prediction of house pricing according to the number of rooms, furnishing, age, etc.
Detection of where a bank transaction is fraud or not

There are many more examples of machine learning, here we are going to discuss Supervised Machine Learning, There are two parts of data features and labels, features are the input for the model just like the size of tumors is if we put the size of tumors the model will tell us whether it is malign or benign, the prediction here whether malign or benign are the labels, there some types of data which does not contain labels such as a grouping of news which are related does not require any labels, but here in supervised learning, we are concerned with data labels, so loosely we can say the Machine Learning modeling with labels are known as supervised learning.

For further understanding, we are going to use iris datasets, which have 4 features Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and one target variable Species

This is a long dataset with labels virginica, setosa and Versicolor however we are representing only part of data so we can see in the target column we have the only setosa

The realization of the target variable is known as labels however most of the data scientists use them interchangeably. The predictor variable and feature are the same thing and also known as the independent variable, while the target variable is known as the dependent variable

Classification

Classification is a machine learning models which classify things , such as classifying mail is spam or not , or in the iris data classifying where the plant is virginica , setosa or versicolor is the classification.

First of all we gonna load our dataset using the following codes, which also imports pandas and numpy under their usual aliases.

from sklearn import datasets
import pandas as pd
import numpy as np
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

We can see that iris dataset is a bunch, bunch is a datatypes which have a key value pairs, we can look at the pairs using following code

print(iris.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

type(iris.data),type(iris.target),type(iris.target_names),type(iris.DESCR),type(iris.feature_names)

(numpy.ndarray, numpy.ndarray, numpy.ndarray, str, list)

We can see that iris.data and iris.target is numpy array , also target names is also an array , DESCR is string and features names is string, if we iris.data.shape and iris.target.shape we can see data has shape 150 rows and 4 columns and this is our features,we can take a look at our data by the command print(iris.data) , similarly the shape of target variable have 150 rows and 1 columns as we expected and we can look at it using print(iris.target) However our target variable is encoded where

0 represent setosa
1 represent versicolor
2 represent virginica

It can be seen using iris.targets and it is also described in iris.descr, let us store iris.data in variable X and iris.target in y

X = iris.data
y = iris.target

Let us construct a dataframe from the X which have header as iris.feature_names and show how our dataframe actuaaly looks like using head() method

df = pd.DataFrame(X , columns = iris.feature_names)
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

k-Nearest Neighbours

Now let us train our first model using the k-Nearest Neighbors (or kNN), it is quite simple, first suppose there are only two features in our dataset then we can plot each observation (that is a single row in a dataset ) simply on the 2D plane as a point where the first feature is on the x-axis and second feature on the y-axis, and suppose the color of the point is a label that can be red or blue, suppose we get a feature with know label on it only with two features now we can plot that point on the same 2D plane but we cannot determine the color of the point since it is not labeled, now we have to predict label suppose we take 3 nearest observation on the plane then it is kNN with k=3 now we have to take the majority vote of 3 nearest neighbors, 2 of them is blue so our prediction is blue, our prediction may change with change in k, suppose k=5 now out of the 5 nearest neighbors 3 are red and 2 are blue then we predict red

This algorithm can be extended to n features where n number of features is greater than 2, by plotting the points in an n-dimensional euclidean plane and then computing the nearest neighbors

Training and Prediction

In Scikit Learn there are two important methods .fit that will be useful for training the model and .predict to predict the label using a trained model, now to use kNN we have to import sklearn.neighbors from sklearn library using from import KNeighborsClassifier and then we have to initialize it and set the value for k let set it to 5 using KNeighborsClassifier(n_neighbors=5) then we will fit the data using .fit method

from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)  #Storing the model in varible knn_model
knn_model.fit(X,y)                               #Fitting ot training the model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Now we have trained our model and stored it into the variable knn_model , Now if we have given the sepal length, width, and petal length, we can predict the species let us predict for 4.6,3.8,3.7,0.9 as sepal length, sepal width, petal length and petal width using .predict method

knn_model.predict([[4.4,3.8,3.7,0.9]])

array([1])

Hence we can see we have predicted 1 which represents Versicolor similarly we can do a lot of prediction at once by creating a NumPy array and then passing it as an argument to the knn_model.predict(), we must take care that the number of columns is equal to the number of features that we have used to train the model, now let us see an example

array = np.array([[4.4,3.8,3.7,0.9],
                  [3.2,5.7,2.0,1.3],
                  [5.5,1.9,2.8,4.7],
                  [3.2,9.7,6.2,1.0]])

prediction=knn_model.predict(array)

print(prediction)

[1 0 2 2]

iris.target_names[prediction]

Now we can get decoded species name by passing the prediction to iris.target as an index

array(['versicolor', 'setosa', 'virginica', 'virginica'], dtype='



Measuring the performance

Now we have trained or model , now we must measure the performance of our model to get the idea of how good or how bad is our model,there are various metric to measure the performance such as Accuracy , Precision , F-Measure etc. but one of the question we have is to which data to use for calculating performance since the data used for training will give too optimistic metric , and may be good only for the data that we have used for training however our main target in machine learning models to train the data such that is predicts the labels for new data, so we need to calculatr our metric on the new data but that is not possible since new data will not be labeled , so a typical operating procedure for a datascientists to split the data into train and test sets where train set will be used for training and the test set will be used for testing and so on calulating the metric such as Accuracy and all we are going to use accuracy here that is equal to the total true prediction divided by toal number of prediction , suppose we 100 observation in test sets and out of them our model predicte 75 of them true , that means there are 75 prediction ehic are right and 25 are wrong so at las t we can say  accuracy

Splitting the dataset into Train and Test sets

To split the dataset, first of all, we will import train_test_split from sklearn.model_selection, now the method train_test_split() will take some arguments, the first argument will be feature data and the second will be labels and that will be train_test_split(X,y) however this method will work fine, but to increase the usability of method it can take more arguments such as


  test_size which is a proportion of the test set, default is set to 0.25, which means it will split 25% of the data as a test set and 75% train set however if someone wants test set to be 20% and train set 80% they can use test_size = 0.2
  random_state it is the seed for the generation of random numbers, look the train_test_split method split the dataset randomly, it does not just take 25% data from the data for the test set, it randomly selects data for test sets, suppose we want in future to generate same train and test set in future for our datasets we can generate same test and train dataset using the same random_state
  stratify argument is “y” if we want our test set to have the same proportion of labels as our dataset, this argument stratify dataset according to the labels, suppose in our iris dataset there are three labels setosa, Versicolor and Virginica Now in this case our dataset will be split into three datasets first containing only those observation whose label is setosa, second with Versicolor and the third with virginica, then it will take 25% of them, randomly from all of them and merge them to create test set, in this way we know the proportion of setosa, Versicolor and viginica is same in the test set and iris dataset


Let us talk about the output of the train_test_split method it will give four arrays, the feature of the train set, the feature of the test set, labels of the train set and labels of the test sets, lets split our dataset, and train our model on the training set

from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y)
knn_model.fit(X_train,y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')


Now we will use the trained model to predict the labels of test set X_test

X_test_prediction = knn_model.predict(X_test)
X_test_prediction


array([2, 1, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 1, 2, 1, 0, 2, 0, 1, 0, 1, 2,
       2, 1, 0, 1, 0, 1, 0, 2, 1, 1, 0, 2, 1, 2, 0, 2])


Further we use .score method to calculate Accuracy , this method will take arguments the test set and labels of the test sets

knn_model.score(X_test,y_test)


0.9736842105263158


Hence we can see that we have about 97% Accuracy. Here we have used k=5, but question what will happen if we increase k. kNN models create a decision boundary, which divides the whole euclidean space into different regions where the number of regions is the number of classes, in our example kNN will divide the 4-dimensional euclidean space(4 dimensional because there are four features) into 3 regions and any new data label will be decided upon in which region it falls, our question here is what happens it we increase k so as we increase our decision boundary will smoothen.


Photo credit: An Introduction to Statistical Learning with Applications in R   (Available for FREE!!!  )


As we can see for k=1 our decision boundary represented by black is too much fitted as k=100 we can see that our decision boundary is too much smoothed. So if k is large the decision boundary will be smoother hence a less complex model however for small k the decision boundary will less smooth and give a complex model, which will be more sensitive to the noise in the data, which may give a good prediction for training data but may fail on new data, this is also known as overfitting if we increase k too much the decision boundary will be too much smoothed (tend to become straight line) and may not perform well on both of the test and train set as can see in the figure for k=100 and this is commonly known as underfitting so we must choose k such that neither it is under fitted nor overfitted that means choose k neither too large neither too small, for k=10 we will get following


Photo credit: An Introduction to Statistical Learning with Applications in R   (Available for FREE!!!  )


Confusion Matrix

Accuracy is not always a good metric for measuring the performance of classification problems, suppose we have data for transactions from a bank and we have to create a model which classify whether a transaction is a fraud or not fraud, usually a lot of transactions are non-fraudulent let us say 95% are not fraudulent, this type of data is known as imbalanced data when one of the class  is too frequent and for imbalanced data out accuracy metric does not perform well for imbalanced data, so there are other metrics to measure the performance of a model and they can be obtained from a very famous matrix known as Confusion Matrix

In Binary Classification there are two classes Positive and Negative, we call those classes positive class which we are interested in, suppose we want to model a transaction fraud then we are interested in the transactions which are fraud, then the class fraud is positive class and non-fraud class is negative, Various Metric can be calculated by By the Following Formulas



F1- Score can also be interpreted as Harmonic Mean of Precision and Recall , and given by

\[F1 \ Score = 2 \cdot \frac{precision \cdot recall}{precision + recall}\]

Confusion Matrix can be calculated

  import confusion_matrix method from sklearn.metrics
  use confusion_matrix ,with a first argument actual test labels and second argument as prediction of the lebels


from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,X_test_prediction))


[[13  0  0]
 [ 0 13  0]
 [ 0  1 11]]


Here we got $3\times 3 $ matrix because we have 3 classes , we are not limited to only two class positive negative, here we have three classes of labels i.e ‘versicolor’, ‘setosa’, ‘virginica’, now to get the performance metrics we have to run the following codes

from sklearn.metrics import classification_report
print(classification_report(y_test,X_test_prediction))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.93      1.00      0.96        13
           2       1.00      0.92      0.96        12

    accuracy                           0.97        38
   macro avg       0.98      0.97      0.97        38
weighted avg       0.98      0.97      0.97        38




Regression

In regressions target variable is a continuous variable as price of a mobile,temperature and etc. To get started let us took diabetese dataset , which is already persent in sklearn module

from sklearn import datasets
import pandas as pd
import numpy as np
boston = datasets.load_boston()


Let us take a look at what we have imported in data variable using .keys() attribute

boston.keys()


dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


Now we have data and the feature names , so we can create a dataframe from the data and feature names and can take a look at the from the head method, as we have done in Classification

X = boston.data
y = boston.target
df = pd.DataFrame(X , columns = boston.feature_names)
df.head()



  
    
      
      CRIM
      ZN
      INDUS
      CHAS
      NOX
      RM
      AGE
      DIS
      RAD
      TAX
      PTRATIO
      B
      LSTAT
    
  
  
    
      0
      0.00632
      18.0
      2.31
      0.0
      0.538
      6.575
      65.2
      4.0900
      1.0
      296.0
      15.3
      396.90
      4.98
    
    
      1
      0.02731
      0.0
      7.07
      0.0
      0.469
      6.421
      78.9
      4.9671
      2.0
      242.0
      17.8
      396.90
      9.14
    
    
      2
      0.02729
      0.0
      7.07
      0.0
      0.469
      7.185
      61.1
      4.9671
      2.0
      242.0
      17.8
      392.83
      4.03
    
    
      3
      0.03237
      0.0
      2.18
      0.0
      0.458
      6.998
      45.8
      6.0622
      3.0
      222.0
      18.7
      394.63
      2.94
    
    
      4
      0.06905
      0.0
      2.18
      0.0
      0.458
      7.147
      54.2
      6.0622
      3.0
      222.0
      18.7
      396.90
      5.33
    
  


Before Training the model let us split our data , we can not use stratify attribute here because our target varible is not categorical.

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 )


Linear Regression

When we assume the target variable y is a linear function of columns of X, or we can say linear functions of features the model is known as linear regression, it can be represented as

\[\hat{y}_i = \sum_{i=0}^p a_{i}x^{i}\]

Linear regression is an equation of line,and $a_i$ are known as parameters of linear regression.

Now our main aim is to set $ a_i $ such as the predicted value of y generally represented by  is nearest to the actual value of y, to measure the amount of difference between the predicted and actual we use loss functions, these are a special type of functions which give 0 when the predicted value for the label is equal to the actual label, one of the most common loss function is squared error loss function given by

\[Loss(\hat{y} ;y)= \sum_{i=0}^n(y_i - \hat{y}_i)^{2}\]

So our problem is to reduce loss, so to reduce we have to set optimized parameters which reduce loss, but for this type of loss the Estimation of the parameters that are $a_i$ are known as a least square estimate, for a different type of loss functions we can get different types estimate,but least square estimates are most used so we are gonna discuss this


  p is the number of features , hence there will be (p+1) parameters , where we added 1 due to the fact that we have to also estimate the constant term $a_0$ and “n” is the number of observations , or we can say number of rows in the dataset


Now to fit the model , we will run the following code

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


Now we can predict using .predict method as follows

prediction=model.predict(X_test)


As we have seen the metric to measure the performance of a model is Accuracy in the classification section however for regression we can not use Accuracy one of the mostly used metric for regression is $R^2$ which is defined as proportion of variability in Y that can be explained using X , it can be calculated by following code

model.score(X_test,y_test)


0.736994702163782


Generally $R^2$  range vary from 0 to 1.
When $R^2$ is near to 1 it represent that the model is good and when it is near to 0 the model fitted is not good

Cross Validation

Cross-Validation is a method that reduces our dependency on how the data splits in train and test, there may be, only by chance that our performance metrics are representing our model as good, this is due to the fact we do not use all the data to calculate performance metrics. To eradicate the dependency on only one train test splits data we use Cross-Validation, or we can say k-fold Cross-validation where is k is a parameter and a positive integer suppose k=5 means there are 5-fold cross-validation, it simply divide our observations in our dataset into 5 groups commonly known as a fold, then we hold the first fold as a test set and all other folds are merged to create train set and then we calculate the performance metric we are interested in, and then do the same again by holding second fold as a test set and remaining as train set and after that calculate performance metric, this is known as a performance metric for the second split, similarly in k fold cross validation we calculate performance metric k times for k splits and further after calculating metric for every split we can calculate statistics of our interest such as the mean of these k performance metrics or mode, median or whatever statistic we want



k-Fold Cross Validation is computationally expensive , since we have to do the whole process of training, prediction and metric calculation k times , following is the way to do so


  Import cross_val_score
  call the cross_val_score with arguments the model , features array , labels , number of fold suppose for 5 fold cv=5, and store it in a variable
  Call the statistics function such as mean or mode on the variable such as np.mean() for mean


from sklearn.model_selection import cross_val_score           #Importing class
cross_validation_result = cross_val_score(model,X,y,cv=5)     #Initializing 
np.mean(cross_validation_result)


0.3532759243958772


Shrinkage Method

Shrinkage is also known as Regularization, In general, we estimate the parameters  , but sometimes they are two large and lead to higher variance so it is advisable to shrink the parameters toward 0, it can be done in various ways two of the famous one is Ridge Regression and Lasso

Ridge Regression

For Ridge Regression we just edit our general Loss function as following

\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p a_i^2\]

Where $\alpha \geq 0$ is a tuning parameter and $\alpha \sum_{i=1}^p a_i^2$ is known as shrinkage penalty, here we must note that we have not the term for  in the shrinkage penalty, unlike Least Square Estimate here we get different sets of parameters for different value of tuning parameter, however for tuning parameter equal to zero will lead to Least Square Estimate and may have a greater chance of overfitting, and a very large tuning parameter will penalize the parameters too much which can lead to underfitting so we have to choose tuning parameter such as it optimizes our model

To do Ridge Regression

  Import Ridge from the module sklearn.linear_model
  Then initialize  Ridge() class  , with passing the tuning parameter to alpha argument
  Then Train and predict as usual


from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha= 0.9 )
ridge_model.fit(X_train,y_train)
ridge_model.score(X_test,y_test)


0.7345197081669743


Lasso Regression

Ridge Regression has a demerit that it shrinks the parameters towards  0, but never set the parameters equal to 0, there may be some features which don’t explain any variance in the label that coefficient needs to be set equal to zero, to increase the model interpretation. For Lasso, we just add modulus of the parameters at the place of the square of parameters as in Loss of Ridge Regression

\[Loss(y \ ; \hat{y}) = \sum_{i=0}^n (y_i-\hat{y}_i)^2 + \alpha \sum_{i=1}^p |{a_i}|\]

Lasso shrinks the coefficient of feature to 0 for the features which are less important

Lasso Regression have similar codes scrips as ridge Regression

from sklearn.linear_model import Lasso
lasso_model = Ridge(alpha= 10)
lasso_model.fit(X_train,y_train)
lasso_model.score(X_test,y_test)


0.7257977554026047


Logistic Regression

Logistic Regression, despite its a regression it is used in classification problem mostly, it finds out the probability that a given observation belongs to a particular class if it is greater than 0.5 or we can say 50%, then our model predict the observation label belong to that class, It estimates the probability using the following function

\[p = \sigma\left(\sum_{i=0}^p a_ix^i\right)= \frac{1}{1+ e^{-\sum_{i=0}^p a_ix^i}}\]

But we will not go in theory too much, and focus on practical use.
To Use Logistic Regression, it  is similar to the work we have done earlier, import function, import data, split data, then test you, models, using performance metrics Let us do that, let us do this on breast cancer data, that is already available in sklearn module

from sklearn import datasets
import pandas as pd
import numpy as np
bcancer = datasets.load_breast_cancer()              #Loading Data
from sklearn.linear_model import LogisticRegression  #Importic class for logistic regression
LogReg_MODEL = LogisticRegression()                  #Initializing Logistic Regression class
X = bcancer.data
y = bcancer.target
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size = 0.25 , stratify=y)  #Splitting Data
LogReg_MODEL.fit(X_train,y_train)                                                          #Training the model
AccuracyLogReg = LogReg_MODEL.score(X_test,y_test)
print(AccuracyLogReg)


0.958041958041958


ROC Curve

ROC Curve is short form for receiver operating characteristic curve.

Threshold

We generally take threshold 0.5 that means in kNN when the number of a particular class label is greater than 0.5 of the total class label we predict it belongs to that class label , suppose we fitted kNN for k=100 , and we have two class label red and blue then we will predict red for an observation if more than 50 of the neighbors are red that 50 is the threshold number , that is number of red neighbors to classify it as red , that 50 is 0.5$\times$ 100 , so here we have threshold 0.5 , similarly in logistic regression p=0.5 is threshold in general

True Positive Rate and False Positive Rate (TPR and FPR)

True Positive Rate is also known as Recall and false positive rate is given by

\[FPR = \frac{FP}{FP+TN}\]

Model always do not perform well when the threshold is 0.5 sometimes , model performs better with threshold other than 0.5 ,to know that we use ROC curve , ROC curve is a graph between TPR and FPR and for different threshold we get different ROC curve


  When threshold is 0 , means we will predict all the observation as positive and then TPR will be equal to 1 and false positive rate will also be 1
  When threshold is 1 , both TPR and FPR will be equal to 0


To know how good is or model we use the area under curve (AUC) as a performance metric for ROC curve, lets say we have a perfect classifying model then TPR will be equal to 1 and FPR will be equal to 0 , this will be when area under the curve equal to 1 , so we can use AUC ROC as a performance metrics



Now to create ROC curve we have to do the following


  Import roc_curve from sklearn.metrics
  Use roc_curve() function with following two arguments
    
      
        y_true array, shape = [n_samples]

        True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.
      
      
        y_score array, shape = [n_samples]

        Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
      
    
  
  
    Now to calculate y_score we will use probability estimates , that we can get by useing .predict_proba() method on the test set , it will give output an array with two columns , first column is estimate and second column is probability , that is our y_score ,to get that we will subset that and take only second column by [:,1]
  
  Further roc_curve will have three output , we will store those variable , FPR , TPR and thresholds
  After that we will import matploblib.pyplot with alias plt and use those .plot to plot ROC curve


from sklearn.metrics import roc_curve
y_score = LogReg_MODEL.predict_proba(X_test)
y_score = y_score[:,1]   #Subsetting only first column
fpr, tpr, thresholds = roc_curve(y_test, y_score)


# Now to plot the ROC curve

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, linewidth=1)
plt.plot([0, 1], [0, 1], 'k--')             # to plot the dashed diagonal
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate or Recall')
plt.show()                                  # to show the plot




Now we want performance metrics for model , and that is AUC, to calculate auc we just need to import roc_auc_score and pass the same as we passed to the roc_curve

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_score))


0.9903563941299791


Tuning the Model

Hyperparameters

Hyperparameters are the parmeters of the learning algorithm model , as for the value  k in k-Nearest Neighbor model is hyperparemeter or tuning parameter in ridge and lasso regression etc. For finer model we have to tune hyperparameters to the best setting.There are not any cut and clear to go for to do hyperparameter tuning. One of the philosphy is to randomly select hyperparameters and train and test and choose one which is better. Manually fidding hyperparameters then doing the whole lot of training and testing is a tedious job to do , so Scikit Learn have a GridSearchCV to help us.

Grid Search

GridSearchCV uses cross validation so that the a hyerparameter selection is not effected by train test split.The class GridSearchCV takes the following attribute


  Model , initialized model for fitting
  param_grid a dictionary or a list of dictionary , this is the manual values of the hyperparameters we want to feed in
  cv number of folds for cross validation


Let us tune the tuning parameter for Ridge regression model, using boston dataset

from sklearn.model_selection import GridSearchCV
knn_model = KNeighborsClassifier()
param_grid = {'n_neighbors' : [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100]}
X = iris.data
y = iris.target
knn_modelGridSearch=GridSearchCV(knn_model,param_grid,cv=10)
knn_modelGridSearch.fit(X,y)
print(knn_modelGridSearch.best_score_ , ridge_modelGridSearch.best_params_)


0.9800000000000001 {'n_neighbors': 6}


Here we can see, we get best results with 6 nearest neighbors.

Now here it ends, Happy Learning

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33



Tensorflow Hello World
2020-06-23T00:00:00+00:00
Tensorflow is made up of two words tensor and flow , where tensor means multidimensional array and flow means graph of operations. It is developed by google brains team. It is released under Apache 2.0 license. It is a package in python and concurrently spreading in other languages such as R , Julia etc. Tensorflow have very smooth learning curve and it is easy for newcomers to grasp vast machine learning easily.

To get started we have download python anaconda version , that will automatically install jupyter notebook and then install Tensorflow to so this read our article here


  You have to type Shift+Enter to run a cell in jupyter notebook


We will use MNIST dataset , which is developed by Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York and Christopher J.C. Burges, Microsoft Research, Redmond , in the dataset there are 60,000 images of handwritten digits and labeled them for training, and 10,000 images for testing. MNIST dataset is already divided in test and train set so we do not have to take care of that .

First of all we have to import tensorflow , with an alias tf to use tensorflow

import tensorflow as tf


Now we will import dataset and store it


  x_train the trainig images
  y_train label of the training images
  x_test testing images
  y_test label of the testing images


mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()


Now let us take a look at first image of the handwritten digit

from matplotlib import pyplot
pyplot.imshow(x_train[0])
pyplot.show()




Now we will divide x_train and x_test by 255, because our image is RGB so each pixel in our image can take any value between 0 to 255 , and neural networks works fine with range from 0 to 1 , so to normalize our dataset in 0 to 1 we divide both train and tes dataset by 255

x_train, x_test = x_train / 255.0, x_test / 255.0


Now we need to create a model we will use tf.keras.models.Sequential to create a model and we will use four layers in it from the module tf.keras.layers, layers are as follows


  
    tf.keras.layers.Flatten : it will flatten our data , our image is 28 $\times$ 28 pixels , it will flatten the image and convert it into 784 $\times$ 1 , it will take argument input_shape which will be a tuple that define the shape of our input data
  
  
    tf.keras.layers.Dense : it is just a layer with units here 128, and a activation function here relu
  
  tf.keras.layers.Dropout : This layer drop input with a probability of rate(here 0.2) and multiply each non dropped input by $\frac{1}{1-rate}$
  tf.keras.layers.Dense : it is similar to the second layer
  tf.keras.layers.Softmax: it is used because output of the dense layer will be log-odds , softmax function maps logodds to probabilities


model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10),
  tf.keras.layers.Softmax()
])


So for a input i.e 28 $\times$ 28 image , in the model, it gives us an array of 10 floating point number that will be output of the last dense layer.Let us get a output without training the model

predictions = model(x_train[:1]).numpy()
predictions


££ array([[0.09967916, 0.09987953, 0.09993076, 0.10024416, 0.10007039,
££       0.10004147, 0.10008495, 0.09998867, 0.10009976, 0.09998112]],
££       dtype=float32)


Now we can check our model using model.summary()

model.summary()


££  Model: "sequential_5"
££  _________________________________________________________________
££  Layer (type)                 Output Shape              Param #   
££  =================================================================
££  flatten_7 (Flatten)          (None, 784)               0         
££  _________________________________________________________________
££  dense_13 (Dense)             (None, 128)               100480    
££  _________________________________________________________________
££  dropout_7 (Dropout)          (None, 128)               0         
££  _________________________________________________________________
££  dense_14 (Dense)             (None, 10)                1290      
££  _________________________________________________________________
££  softmax_1 (Softmax)          (None, 10)                0         
££  =================================================================
££  Total params: 101,770
££  Trainable params: 101,770
££  Non-trainable params: 0
££  _________________________________________________________________


Now we will define a loss function , we should choose loss function such as if our model predict wrong label our loss will , tensorflow have a lots of inbuilt loss function , here we will use Sparse Categorical Cross entropy, for in detailed description check here

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()


Now our main target is to set trainable params such that we get minimum loss , and we are using accuracy to measure the performance it can be calculated by dividing true class prediction by total predictions , we will use adam as a optimizer

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])


Now let us train our model

model.fit(x_train, y_train, epochs=5)


££  Epoch 1/5
££  1875/1875 [==============================] - 5s 3ms/step - loss: 0.1909 - accuracy: 0.9445
££  Epoch 2/5
££  1875/1875 [==============================] - 5s 3ms/step - loss: 0.1854 - accuracy: 0.9465
££  Epoch 3/5
££  1875/1875 [==============================] - 5s 3ms/step - loss: 0.1821 - accuracy: 0.9471
££  Epoch 4/5
££  1875/1875 [==============================] - 5s 3ms/step - loss: 0.1773 - accuracy: 0.9493
££  Epoch 5/5
££  1875/1875 [==============================] - 5s 3ms/step - loss: 0.1741 - accuracy: 0.9498


Now let us evaluate our model on test set

model.evaluate(x_test,  y_test, verbose=2)


££ 313/313 - 1s - loss: 0.1480 - accuracy: 0.9569
££ [0.14800186455249786, 0.9569000005722046]


As we can see we have approx ~95% accuracy

Here we have created a model , trained it , make predictions on it.

Happy Learning 


Introduction To Non-Informative Priors
2020-06-01T13:24:23+00:00

  Prior density is denoted by $g(.)$ in this article


Introduction

Non-Informative Priors are the priors which we assume when we do not have any belief about the parameter let say $ \theta $ . This leads noninformative priors to not favor any value of $ \theta $ , which gives equal weights to every value that belongs to $\Theta$. for example let us we have three hypothesis , so the prior which attach weight of $ \frac{1}{3}$ to each of the hypothesis is noninformative prior.


  Note : most of the noninformative priors are improper.


An Example

Now let us assume a simple example let us assume our parameter space $\Theta$ is a finite set containing n elements such as

\[{\theta_1,\theta_2,\theta_3,\theta_4....\theta_n} \ \in \ \Theta\]

Now the obvious weight given to each $\theta_i$ when we have  not any prior beliefs is $\frac{1}{n}$ that gives us prior is proportional to a constant because $\frac{1}{n}$ is a constant let us say $\frac{1}{n}$=c hence we can say

\[g(\theta) = c\]

Now let us assume a transformation $\eta=e^{\theta} $ , that is $\theta = log(\eta)$ . If $ g(\theta)$ is the density of $\theta$ then we can write density of $\eta$  as

\[g^*(\eta)=g(\theta)\frac{d\theta}{d\eta} \\
g^*(\eta)=g(log \ \eta)\frac{d \ log \ \eta }{d\eta} \\
g^*(\eta)=\frac{g(log \ \eta)}{\eta} \\
g^*(\eta) \propto \frac{1}{\eta}\]

Thus if we choose prior for $\theta$ as constant , then we have to assume prior for $\eta$ as proportional to $\eta^{-1}$  to arrive at the same answer in both cases either we take $\theta $ or $\eta$ . Thus we cannot maintain consistency and assume both prior proportional to constant . This leads to the search of such noninformative priors which are invariant under transformations.

Noninformative Priors for Location Parameter


  A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a function of $(x - \theta)$


Let X is a random variable with location parameter $\theta$  then density can be written as $h(x- \theta)$. Just assume instead of observing X we observed Y = X+c and let us take $\eta=\theta+c$ then can  see that the density of Y is given by $h(y - \eta)$. Now $(X,\theta) \ and  (Y,\eta)$ have same parameter and sample space which gives us the idea that they must have same noninformative prior

Let $g$ and $g^*$  are noninformative priors for  $(X,\theta) \ and  (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors , let us assume a subset of real line A

\[P^g(\theta \ \in \  A ) = P^{g^*}(\eta \ \in \  A )\]

Now we have assumed $\eta=\theta+c$ so

\[P^{g^*}(\eta \ \in \  A )=P^{g}(\theta +c \ \ \in \  A )=P^{g}(\theta \ \in \  A-c )\]

which leads us to

\[P^{g}(\theta \ \in \  A)=P^{g^*}(\theta \ \in \  A-c ) \tag{*}\\
\int_Ag(\theta)d\theta=\int_{A-c}g(\theta)d\theta=\int_Ag(\theta-c)d\theta\]

It holds for any set A of real line , and any c on real line so it lead us to

\[g(\theta)=g(\theta-c)\]

Now if we take $\theta=c$ we get $g(c)=g(0)$ ,and we know it is true for all c , it leads us to the conclusion that the prior in the case of location parameter is constant functions , for simplicity most of the statistician assume it equal to 1 , $g(.) = 1$

Noninformative Priors for Scale Parameter


  A Parameter is said to be location parameter if the density $f(x ; \theta)$ can be written as a  $\frac{1}{\theta}h(\frac{x}{\theta})$ where $\theta>0$


For example in normal distribution we $N(\mu,\sigma^2)$ , $\sigma$  is a scale parameter .

To get noninformative prior for Scale Parameter $\theta$ of a random variable X , instead of observing X we observe $Y = cX$ for any $c > 0 $ , let us define $\eta = c\sigma$ , so then the density of $Y $ is given by $\frac{1}{\eta}f(\frac{1}{\eta})$ .

Now similar to previous part here $(X,\theta)$ and $(Y,\eta)$ have same sample and parameter space , so both will have same noninformative priors. Let $g$ and $g^*$  are noninformative priors for  $(X,\theta) \ and  (Y,\eta)$ respectively. So according to our argument both will have same noninformative priors

\[P^g(\theta \in A)= P^{g^*}(\theta \in A)\]

Here A is a subset of Positive real line, i.e $A \subset R^+$ , now putting $\eta = c\sigma$

\[P^{g^*}(\eta \in A) = P^g(\theta \in \frac{A}{c}) \\

P^g(\theta \in A) = P^g(\theta \in \frac{A}{c}) \\

\int_Ag(\theta)d\theta=\int_{\frac{A}{c}}g(\theta)d\theta=\int_A\frac{1}{c}g(\frac{\theta}{c})d\theta\]

so

\[g(\theta)=\frac{1}{c}g(\frac{\theta}{c})\]

Now taking $\theta=c$ , we get

\[g(c)=\frac{1}{c}g(1)\]

Now this equation is true for any value $c>0$ so , for convenience taking $g(c)=1$ , it gives us noninformative prior $g(\theta)= \frac{1}{\theta}$


  Note : It is an improper prior , $\int_0^{\infty}\frac{1}{\theta}d\theta = \infty $


Flaw and introduction of relatively location invariant prior

Now we know noninformative prior for both Scale and Location parameter, but there is flaw . The prior we get for location and scale parameter in previous part are improper priors . If two random variables have identical form , then they have same non informative priors . but the problem here is due to improper priors  ,  noninformative priors are not  unique. lets say we have an improper prior g then if we multiply g by any constant k then the resultant gk will give same bayesian decisions as g.

Now in previous parts we have assumed two priors $g$ and $g^* $ , but we do not need that , we can get $g^*$ by just multiplying $g$ by a constant and vice-versa.

Now equation $(*)$ can be written as

\[P^g(A)=l(k)P^{g}(A-c)\]

Where $l(k)$ is some positive function ,

\[\int_Ag(\theta)d\theta=l(k)\int_{A-c}g(\theta)d\theta=l(k)\int_Ag(\theta-c)d\theta\]

It holds for all A , so  $g(\theta)=l(k)g(\theta-c)$ , and taking $\theta=c$ give us $l(k)=\frac{g(c)}{g(0)}$ , putting this value back will give us

\(g(\theta-c)=\frac{g(0)g(\theta)}{g(c)} \tag{**}\)
Now there is a lot of prior other than $g(\theta)=c$ , which satisfy equation (** ) , so any prior of this form will be know as relatively location invariant

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2