---
title: "countprop"
output: 
  pdf_document: default
  extra_dependencies: ["amsmath","amsthm","amssymb","bbm","mathrsfs","mathtools","xfrac","breqn","bm"]
header-includes: 
  \newcommand{\Omegab}{\mathbf{\Omega}}
  \newcommand{\Sigmab}{\mathbf{\Sigma}}
  \newcommand{\mub}{\boldsymbol{\mu}}
  \newcommand{\pb}{\textbf{p}}
  \newcommand{\ev}{\mathbb{E}}
  \newcommand{\Ib}{\textbf{I}}
  \newcommand{\Xb}{\textbf{X}}
  \newcommand{\Bb}{\textbf{B}}
  \newcommand{\Fb}{\textbf{F}}
  \newcommand{\Yb}{\textbf{Y}}
  \newcommand{\Wb}{\textbf{W}}
  \newcommand{\wb}{\textbf{w}}
  \newcommand{\Vb}{\textbf{V}}
  \newcommand{\Qb}{\textbf{Q}}
  \DeclareMathOperator{\alr}{alr}
  \DeclareMathOperator{\alri}{alr^{-1}}
  \DeclareMathOperator*{\argmax}{arg\,max}
  \DeclareMathOperator{\tr}{tr}
  \newcommand{\lb}{\left[}
  \newcommand{\rb}{\right]}
  \newcommand{\lc}{\left\{}
  \newcommand{\rc}{\right\}}
  \newcommand{\lp}{\left(}
  \newcommand{\rp}{\right)}
vignette: >
  %\VignetteIndexEntry{countprop}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction 
The \texttt{countprop} package allows estimation of several types of proportionality metrics for count-basedcompositional data such as 16S, metagenomic, and single-cell sequencing data.  The package includes functions that allow standard empirical estimates of these proportionality metrics, as well as estimates based on the multinomial logit-normal model.

First, we'll define the model.  Assume $n$ samples and $J+1$ features.  Suppose the counts for sample $i$ for feature $j$ are denoted by $y_{ij}$ for $i=1,\dots,n$ and $j=1,\dots,J+1$.  They are modelled using the multinomial distribution:
```{=latex}
\begin{align*}
    y_i &\sim \text{Multinomial}(M_i; p_{i1},\dots,p_{i(J+1)}),
\end{align*}
```

where $M_i=\sum_{j=1}^{J+1} y_{ij}$ and proportion vector $\pb_i=(p_{i1},\dots,p_{i(J+1)})$.  The proportions themselves are modelled using a logit-normal model, which can be formulated through a set of latent vectors $\left( w_{i1}, \dots, w_{iJ} \right)$ which are related to the proportions by:
```{=latex}
\begin{align*}
p_{ij} &= \alr^{-1}\lp\wb_i\rp \\
    &= \begin{cases} 
	    \frac{\exp\lc w_{ij} \rc}{1+\sum_{j=1}^J \exp\lc w_{ij} \rc} & \mbox{if } j=1,\dots,J \\
		\frac{1}{1+\sum_{j=1}^J \exp\lc w_{ij} \rc} & \mbox{if } j=J+1.
		\end{cases}
\end{align*}
```

The latent vectors are distributed as multivariate normal:
```{=latex}
\begin{align*}
	\left( w_{i1}, \dots, w_{iJ} \right) &\sim \mbox{MV-Normal}_J \left( \mub, \Sigmab \right).
\end{align*}
```

The read-depths are assumed to be distributed as log-normal:
```{=latex}
\begin{align*}
    M_i \sim \text{Log-Normal} \left(\mu_\ell, \sigma_\ell^2 \right).
\end{align*}
```

Finally, to guard against spurious correlations, we apply the $L_1$-penalty to the inverse covariance matrix $\Sigmab$ (i.e.\ the ``graphical lasso" penalty).
```{=latex}
\begin{align*}
    \ell\lp w_{i1}, \dots, w_{iJ} \rp = \log\det \Sigmab^{-1} - \tr(S \Sigmab^{-1}) - \lambda || \Sigmab^{-1} ||_1
\end{align*}
```

# Fitting the model

The \texttt{countprop} package has a built-in function to estimate the model parameters.  First, let's load the countprop library and look at the first few lines of the murine single cell sequencing dataset included with the package:
```{r setup}
library(countprop)
data(singlecell)

head(singlecell, 2)
```

To fit the multinomial logit-normal model, we can use the \texttt{mleLR()} function:
```{r mle}
mle <- mleLR(singlecell)

# Maximum likelihood estimates of model parameters
mle$mu
mle$Sigma.inv
```

For the \texttt{mleLR()} function, it is necessary to specify a value for $\lambda$, which is the graphical lasso penalty parameter.  The default is 0.  However, we can also run multiple values of $\lambda$ to find which one leads to the best fit based on the Extended Bayesian Information Criterion (EBIC).  To do this, we use the \texttt{mlePath()} function.  This allows us to choose the number of $\lambda$ values we want to run the model on (\texttt{n.lambda} parameter).  This can also be parallelized by setting \texttt{n.cores>1}.  Once we've obtained the model fit, we can visualize the EBIC values for each $\lambda$ value using \texttt{ebicPlot()}.
```{r lambdaselect}
mle2 <- mlePath(singlecell, n.lambda=10, n.cores=1)
mle2$min.idx # Index of smallest lambda value

# Plot EBIC for different lambda values
ebicPlot(mle2)
```

In this case, the optimal value of $\lambda$ is the one in the first position of the lambda vector.  When the optimal $\lambda$ value is the smallest one considered, then it's possible that an even smaller $\lambda$ value would be optimal and was not considered.  In this case, the argument \texttt{lambda.min.ratio} can be reduced from its default of \texttt{0.1}:
```{r lambdaselect2}
mle3 <- mlePath(singlecell, n.lambda=10, lambda.min.ratio = 0.0001, n.cores=1)
mle3$min.idx

ebicPlot(mle3)
```

The minimum EBIC now corresponds to the 3rd smallest value of $\lambda$.


# Estimating the proportionality metrics

Once the model parameters have been estimated, the model-based proportionality metrics can be estimated:
```{r lnvariation}
# Variation matrix
logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma)

# Phi matrix
logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma, type="phi")

# Rho matrix
logitNormalVariation(mle3$est.min$mu, mle3$est.min$Sigma, type="rho")
```

The package also provides the standard naive (empirical) estimates of the proportionality metrics.
```{r naivevar}
# Naive (empirical) variation matrix
naiveVariation(singlecell)

# Naive (empirical) Phi matrix
naiveVariation(singlecell, type="phi")

# Naive (empirical) Rho matrix
naiveVariation(singlecell, type="rho")
```