Bayesian Cell Counts
     

Bayesian Cell Counts

R example

We have provided an example of our analysis that is hopefully easy to translate to other examples. This is in:

github.com/BayesianCellCounts/R_example

This contains an example data file, data.csv. This file is based on the one used in the case study 1 but has been preprocessed, in case study 1 some of the processing is performed by the R script, here, to make the R script more generic, these steps are done in advance, the details are given in data.README. The sample data includes 23 brain regions for 40 animals with 10 animals in each of four cohorts. The file data.csv contains four columns:

  • id A unique identity number for each animal, in our case this is a number from 1 to 40; importantly the identity number is different for animals even if they could be distinguished by cohort.
  • region This is the name of the different brain regions, a string in each case.
  • group There are four groups, again, labelled by a string.
  • count This is the cell count, in our case this is not an integer because we have normalization by area, often it will be an integer, the normalization was performed because the data included a quantification of the `useful portion' of the sample, but it is not required.

Our dataset does not have a count for every (animal,region) pair, there is missing data and this is allowed by the model. However, it does not allow more than one count for each (animal,region); the preprocessing we have done includes a set which sums over repeated counts and duplicates will cause an error.

To run the code use:


	      Rscript sample_posterior.r
	  
This will sample the posterior using the model stored as model/model_hs.stan. This will produce four chains stored as fits/fit_hs.rds.

The chains contain multiple samples from the posterior which allows probabalistics statements to be made about the model. Typically this means plotting graphs and most often these are `contrast' graphs, as in Figure 4B/4C in Dimmock et al. (2025). This graph is produced by the plot_results.r and running:


	      Rscript plot_results.r --help
will give options; an example of this command for our sample data is

              Rscript plot_results.r Sham_Familiar Sham_Novel -o tabs
This computes the contrast between two of our conditions, Sham_Familiar and Sham_Novel and orders the regions in the graph by the absolute value of the frequentist mean. This plotting only works if the chains are available; the graph is stored in results/. As described under --help there is an option for changing the format of the output file, this probably works but we have only actually tests png and pdf.

To produce the diagnostic plots use plot_diagnostics.r, this has a help file accessed as


	      Rscript plot_diagnostics.r --help
and adds the graphs in the diagnostics folder.

Stan model

The Stan model contains the priors. The business part is the line


gamma[i] = theta[group_idx[i]] + kappa[i] .* tau[group_idx[i]] .* gamma_raw[i];
the exponential of gamma[i] is the mean of the Poisson process that generates the count for sample i. The paper makes more of an attempt to describe this well and at length, but roughly and in short, the theta[group_idx[i]] is the value that gamma[i] is expected to take because of the cohort and region that sample comes from. The rest of the expression deals with variability around this value and includes the machinery of the horseshoe prior. The prior for theta is provided by

theta[g] ~ normal(5, 2);
this implies that our rough a priori guess for the mean of the logged cell count values is 5, plus or minus 2. If the mean of your log cell counts differ significantly from this, it might be useful to change these values before running the model on your data.