Bulk Linking Analysis for Single-cell Experiments • blase

Overview

The goal of BLASE is to enable you to map bulk RNA-seq samples onto Single Cell RNA-seq for further analysis, with an emphasis on trajectories (but it can work for any continuous variable across your data!).

It provides:

Configurable discretisation of pseudotime into “pseudotime bins”
A custom “Gene Peakedness” method for identifying temporally variable genes
Annotation of scRNA-seq based on bulk samples
Mapping of bulk RNA-seq onto these bins.
Plotting functions

Installation

You can install the development version of BLASE from GitHub with:

# install.packages("devtools")
devtools::install_github("andrewmccluskey-uog/BLASE")

Getting Started & Usage Notes

Take a look at the Vignette for use with a Single Cell Experiment.

Selection of genes is important for BLASE, as it will base any predictions on the expression of these genes only. Any genes omitted from this list will be ignored. Selecting too few genes reduces BLASE’s ability to dissect different stages from each other. Conversely, a number of genes that is too great (i.e. introducing genes which do not have expression changes over the biological process) only introduces noise, reducing BLASE’s precision. Ideally, these genes should be only those which show substantial change over the pseudotime trajectory in the single-cell reference that one would wish to map to.

For cases of organisms with many genes that are highly expressed for a period over the biological process in a highly regulated way (for example, Plasmodium spp.), we recommend using every gene in the genome.

For organisms which do not show this pattern, such as human or mouse, we recommend using a subset of genes, selected by either TradeSeq or BLASE’s gene peakedness selection. In order to make this process easier for users of BLASE, we provide a function get_top_n_genes() which enables simple selection of a certain number of genes from an associationTest result generated by the TradeSeq package, and convenience functions for calculating the gene peakedness as described here.

Each reference scRNA-seq trajectory will have a unique fingerprint, defined by genes activated over the course of the process. To optimally use BLASE, it is important to consider how many genes will meaningfully contribute to the fingerprint of the process, as well as how many bins should be used to balance accuracy and precision. When selecting the number of genes to use, it is important to select genes which can be used meaningfully to describe the trajectory. Too few genes risks useful signal being lost, however too many genes may introduce unhelpful noise. When selecting the number of bins to use, there is a trade-off between using a small number of larger bins, which can give very accurate (i.e. correct call) readings, and a larger number of bins which gives the desired precision (i.e. more granularity).

BLASE uses a discretised pseudotime value, which we refer to as “pseudotime bins.” BLASE will calculate these bins when creating a BlaseData object, or when using the assign_pseudotime_bins() function to add these to the metadata of SingleCellExperiment or Seurat object. Because the BLASE algorithm relies heavily on these bins, it is important to have a reliable and consistent method to split these. Depending on the dataset, different methods may be required to ensure high-quality mappings. We have found that using the pseudotime_range splitting method works best for most datasets.

The pseudotime range bin assignment method is fast, and (assuming correctness of the pseudotime calculation) implies that each bin will have a constant transcriptional distance between each other, given the assumption that the method to generate the pseudotime accounts for this. However, this method may perform poorly when a reference dataset contains stretches of pseudotime with no or very few cells in it. In this case, splitting by cells may be a better option. When assigning bins to contain a constant number of cells, the pseudotime range covered by each group is not constant, and may be less useful for mapping purposes, but can overcome some of the issues with the pseudotime range method.

BLASE’s main focus is on mapping RNA-seq samples to the discretised pseudotime in a scRNA-seq dataset. Unlike other tools (e.g. CIBERSORTx, DWLS, MuSiC) which estimate a proportion of cells per reference group (typically cell type), BLASE calculates a score for how well each reference group (for BLASE, typically a pseudotime bin) matches a sample, and giving only a single “best match,” and the correlations for all other bins. These values produced by BLASE should not necessarily be treated as proportions of the population in the bulk sample.

Development

Automatic Style Corrections

styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4))

Quality Checks

We subscribe to the R cmd check and BiocCheck guides: