Assigning bulk RNA-seq to pseudotime • blase

library(blase)
library(SingleCellExperiment)
library(tradeSeq)
library(slingshot)
library(scran)
library(scater)
library(BiocParallel)
library(ami)
library(patchwork)

RNGversion("3.5.0")
#> Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
#> 'Rounding' sampler used
SEED <- 7
set.seed(SEED)

N_CORES <- 4
if (ami::using_ci()) {
    N_CORES <- 2
}

In this vignette, we will use BLASE to try to assign bulk RNA-seq samples to the pseudotime trajectory in a scRNA-seq data. We need to have the following to use Blase in this case:

A Single Cell Experiment with pseudotime.
A list of genes that explain the trajectory well, in this case generated by tradeSeq’s fitGAM and associationTest

Setting up the Single Cell Experiment

First, let’s generate a Single Cell Experiment to use the tool with, based on the tradeSeq vignette.

data(countMatrix, package = "tradeSeq")
counts <- as.matrix(countMatrix)
rm(countMatrix)
data(crv, package = "tradeSeq")
data(celltype, package = "tradeSeq")

pseudotime <- slingPseudotime(crv, na = FALSE)
cellWeights <- slingCurveWeights(crv)
sce <- fitGAM(
    counts = counts, pseudotime = pseudotime, cellWeights = cellWeights,
    nknots = 6, verbose = FALSE, BPPARAM = MulticoreParam(N_CORES)
)
sce$pseudotime <- pseudotime[, "curve1"]
sce$celltype <- celltype

sce <- subset(sce, , celltype != "Erythrocyte")

sce <- computeSumFactors(sce)
sce <- logNormCounts(sce)
normcounts(sce) <- exp(logcounts(sce))
sce <- runUMAP(sce)

plotUMAP(sce, text_by = "celltype", colour_by = "celltype") +
    plotUMAP(sce, colour_by = "pseudotime") +
    patchwork::plot_layout(ncol = 1, axis_title = "collect")

Plots showing a UMAP of the single-cell data, coloured by celltype and pseudotime. The UMAP is shaped like a horizontal line, showing the differentiation of a set of immune cell types.

Finding the most descriptive genes with tradeSeq

Now we’ll find the genes we want to use. We select the top 200 so that we can do some parameter tuning below.

# Use consecutive for genes that change over time
assoRes <- associationTest(
    sce,
    lineages = TRUE,
    global = FALSE,
    contrastType = "consecutive"
)
genelist <- blase::get_top_n_genes(assoRes, n_genes = 200, lineage = 1)

TODO: Add the custom BLASE genes method here

Parameter Tuning for BLASE

When using BLASE, it can be a good idea to tune the number of genes and bins used.

Too few genes can lead to poor fitting
Too many genes can lead to slower execution, without any benefit, as the less informative genes still need to be checked.
Too few bins can oversimplify the trajectory
Too many bins can lead to too few cells being available for reliable values.

We can do this using the following commands, provided by BLASE:

res <- blase::find_best_params(
    sce,
    genelist,
    split_by = "pseudotime_range",
    pseudotime_slot = "pseudotime",
    bins_count_range = c(5, 10),
    gene_count_range = c(40, 80)
)

blase::plot_find_best_params_results(res)

Figures showing the minimum and mean convexity, as well as % confident mappings different numbers of genes and bins calculated.

It looks like 80 genes and 10 bins will give us good specificity, but let’s double check. In a non trivial dataset, this may take some repetition. We ignore the 5 bin result because this might not be enough resolution - this depends on your dataset. In general, more bins will reduce specificity because each bin will have a more similar cell composition. In general, aim for about as many bins as you have clusters.

Inspect Bin Choice

Now we can check if this is a good fit by showing how other bins from the SC dataset map using these genes. Ideally, each bin is very specific, with a high “worst specificity”.

blase_data <- as.BlaseData(sce, pseudotime_slot = "pseudotime", n_bins = 10)
genes(blase_data) <- genelist[1:80]

#| fig.alt: >
#|   These plots show the correlations of each bin to itself and all other
#|   bins. This shows how distinct each bin is compared to others, based on
#|   the number of bins and genes selected.
evaluate_parameters(blase_data, make_plot = TRUE)
#> [1]  0.01060  0.03695 30.00000

Inspect Genes Choice

We can also see how the genes change over pseudotime, by plotting expression of the top genes changing over each pseudotime bin.

evaluate_top_n_genes(blase_data)

Mapping Bulk Samples to SC with BLASE

If we’re happy, now we can try to map a bulk sample onto the single cell. We’ll do this by pseudobulking cell types in the SingleCellExperiment but in reality you should use a real bulk dataset. See some of our other articles for examples.

bulks_df <- DataFrame(row.names = rownames(counts(sce)))
for (type in unique(sce$celltype)) {
    bulks_df <- cbind(
        bulks_df,
        rowSums(normcounts(subset(sce, , celltype == type)))
    )
}

colnames(bulks_df) <- unique(sce$celltype)

multipotent_progenitors_result <- map_best_bin(
    blase_data, "Multipotent progenitors", bulks_df
)
multipotent_progenitors_result
#> MappingResult for 'Multipotent progenitors': best_bin=1 correlation=0.982747304266292 top_2_distance=0.0731
#>   Confident Result: TRUE (next max upper  0.932760097129518 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
basophils_result <- map_best_bin(blase_data, "Basophils", bulks_df)
basophils_result
#> MappingResult for 'Basophils': best_bin=6 correlation=0.941795593061416 top_2_distance=0.003
#>   Confident Result: FALSE (next max upper  0.981516882263948 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
neutrophils_result <- map_best_bin(blase_data, "Neutrophils", bulks_df)
neutrophils_result
#> MappingResult for 'Neutrophils': best_bin=10 correlation=0.98019221753399 top_2_distance=0.0584
#>   Confident Result: TRUE (next max upper  0.952078251079002 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations

We can also generate this result with map_all_best_bins, which can also allow for parallelisation of the process for large numbers of bulks:

map_all_best_bins(blase_data, bulks_df)
#> [[1]]
#> MappingResult for 'Multipotent progenitors': best_bin=1 correlation=0.982747304266292 top_2_distance=0.0731
#>   Confident Result: TRUE (next max upper  0.938595771129692 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
#> 
#> [[2]]
#> MappingResult for 'Monocytes': best_bin=9 correlation=0.980496952648851 top_2_distance=0.0072
#>   Confident Result: FALSE (next max upper  0.987944601456498 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
#> 
#> [[3]]
#> MappingResult for 'Neutrophils': best_bin=10 correlation=0.98019221753399 top_2_distance=0.0584
#>   Confident Result: TRUE (next max upper  0.949560708965291 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
#> 
#> [[4]]
#> MappingResult for 'Basophils': best_bin=6 correlation=0.941795593061416 top_2_distance=0.003
#>   Confident Result: FALSE (next max upper  0.979755090550812 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
#> 
#> [[5]]
#> MappingResult for 'Megakaryocytes': best_bin=2 correlation=0.852156586966713 top_2_distance=0.0444
#>   Confident Result: FALSE (next max upper  0.907196547720956 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations
#> 
#> [[6]]
#> MappingResult for 'GMP': best_bin=4 correlation=0.982887951242382 top_2_distance=0.0338
#>   Confident Result: FALSE (next max upper  0.968805338274442 )
#>   with history for scores against 10  bins
#>   Bootstrapped with 200 iterations

Plotting Heatmap of Mappings

To see how well we plotted the differences between these cell types, we can plot a heatmap of all these results. The neutrophils and Erythrocytes are confidently mapped, but the GMP population doesn’t map as well.

plot_mapping_result_heatmap(list(
    multipotent_progenitors_result,
    basophils_result,
    neutrophils_result
), annotate_correlation = TRUE, text_background = TRUE)

A heatmap showing the correlations BLASE found for a subset of cell types in this data. Confident mappings are indicated with asterisks over the mapped tile. This information is also available as text from the results of the previous code block.

Plotting Detailed Correlation Maps

To look into the Basophils mapping, we can plot the correlation over each bin. Because the the lower boundary of the best selection (bin 3) is higher than the higher bounds of the next-best bin (bin 4), BLASE makes a confident call.

plot_mapping_result_corr(basophils_result)

A plot of the correlation of each bin with the basophils pseudobulk sample, also showing upper and lower bounds as calculated by BLASE. The best mapping is for bin 3, and bin 4 is the next most similar.

Plotting Summary Plots of Mappings

Now we can do some more detailed plotting about where the bins mapped onto the Single Cell data, and the cell type proportions we expect , but we need to add the pseudotime bins to the metadata of the SCE. We can see that the true proportions of cell types in the mapped bin do indeed map to the cell type we expect.

binned_sce <- assign_pseudotime_bins(
    sce,
    split_by = "pseudotime_range",
    n_bins = 10,
    pseudotime_slot = "pseudotime"
)

plot_mapping_result(
    binned_sce,
    multipotent_progenitors_result,
    group_by_slot = "celltype"
)

To view the bin population chart in full, use:

plot_bin_population(binned_sce, 1, group_by_slot = "celltype")

A full size bar plot of the cell-type populations of bin 1.

Session Info

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> Random number generation:
#>  RNG:     Mersenne-Twister 
#>  Normal:  Inversion 
#>  Sample:  Rounding 
#>  
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] patchwork_1.3.2             ami_0.2.1                  
#>  [3] BiocParallel_1.42.1         scater_1.36.0              
#>  [5] ggplot2_3.5.2               scran_1.36.0               
#>  [7] scuttle_1.18.0              slingshot_2.16.0           
#>  [9] TrajectoryUtils_1.16.1      princurve_2.1.6            
#> [11] tradeSeq_1.22.0             SingleCellExperiment_1.30.1
#> [13] SummarizedExperiment_1.38.1 Biobase_2.68.0             
#> [15] GenomicRanges_1.60.0        GenomeInfoDb_1.44.2        
#> [17] IRanges_2.42.0              S4Vectors_0.46.0           
#> [19] BiocGenerics_0.54.0         generics_0.1.4             
#> [21] MatrixGenerics_1.20.0       matrixStats_1.5.0          
#> [23] blase_0.99.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] pbapply_1.7-4           gridExtra_2.3           rlang_1.1.6            
#>  [4] magrittr_2.0.3          compiler_4.5.1          mgcv_1.9-3             
#>  [7] systemfonts_1.2.3       vctrs_0.6.5             pkgconfig_2.0.3        
#> [10] crayon_1.5.3            fastmap_1.2.0           XVector_0.48.0         
#> [13] labeling_0.4.3          rmarkdown_2.29          UCSC.utils_1.4.0       
#> [16] ggbeeswarm_0.7.2        ragg_1.5.0              xfun_0.53              
#> [19] bluster_1.18.0          cachem_1.1.0            beachmat_2.24.0        
#> [22] jsonlite_2.0.0          DelayedArray_0.34.1     irlba_2.3.5.1          
#> [25] parallel_4.5.1          cluster_2.1.8.1         R6_2.6.1               
#> [28] bslib_0.9.0             RColorBrewer_1.1-3      limma_3.64.3           
#> [31] boot_1.3-31             jquerylib_0.1.4         Rcpp_1.1.0             
#> [34] knitr_1.50              FNN_1.1.4.1             Matrix_1.7-3           
#> [37] splines_4.5.1           igraph_2.1.4            tidyselect_1.2.1       
#> [40] abind_1.4-8             yaml_2.3.10             viridis_0.6.5          
#> [43] codetools_0.2-20        lattice_0.22-7          tibble_3.3.0           
#> [46] withr_3.0.2             evaluate_1.0.5          desc_1.4.3             
#> [49] pillar_1.11.0           scales_1.4.0            glue_1.8.0             
#> [52] metapod_1.16.0          tools_4.5.1             BiocNeighbors_2.2.0    
#> [55] ScaledMatrix_1.16.0     locfit_1.5-9.12         fs_1.6.6               
#> [58] cowplot_1.2.0           grid_4.5.1              edgeR_4.6.3            
#> [61] nlme_3.1-168            GenomeInfoDbData_1.2.14 beeswarm_0.4.0         
#> [64] BiocSingular_1.24.0     vipor_0.4.7             cli_3.6.5              
#> [67] rsvd_1.0.5              textshaping_1.0.3       S4Arrays_1.8.1         
#> [70] viridisLite_0.4.2       dplyr_1.1.4             uwot_0.2.3             
#> [73] gtable_0.3.6            sass_0.4.10             digest_0.6.37          
#> [76] SparseArray_1.8.1       ggrepel_0.9.6           dqrng_0.4.1            
#> [79] htmlwidgets_1.6.4       farver_2.1.2            htmltools_0.5.8.1      
#> [82] pkgdown_2.1.3           lifecycle_1.0.4         httr_1.4.7             
#> [85] statmod_1.5.0