Starfysh model
Model overview
Deep generative models parameterized by neural networks have proven effective in analyzing single-cell RNA expression data (scvi-tools19, scVI20, totalVI21, scArches22, trVAE23, scANVI24, MrVI25 and so on). However, the presence of multiple cell types in each spot in ST data makes it difficult for these models to disentangle cell type-specific features. To overcome this limitation, Starfysh introduces a generative model with a special variational family that is structured to model the presence of multiple cell states per spot in ST data. The Starfysh generative model leverages gene set signatures (either existing signatures or signatures computed with archetypal analysis) as an empirical prior to help disentangle cell types72. We first detail the generative model of Starfysh and then introduce its structured variational family.
Starfysh generative process
Starfysh models the vectors of gene expression \({x}_{i}\in {{\mathbb{R}}}^{G}\) (with G the number of observed genes) for each spot i with a generative model. The generative model (Fig. 1c) is parameterized by K, representing the expected number of cell states in the data. The determination of K can be automated through archetypal analysis beforehand, or an expert can provide guidance on the K most important cell states in the sample. Each cell state k ∈ [K] is characterized by a low-dimensional latent variable, \({u}_{k}\in {{\mathbb{R}}}^{D}\) (with D defaulting to ten dimensions), capturing the specific mechanisms underlying that cell state. Moreover, each cell state k has a scalar variable, σk > 0, indicating its variability and heterogeneity.
Subsequently, Starfysh models each spot i with a specific low-dimensional representation zi. In the context of single-cell data, each cell state k would usually be represented by a low-dimensional vector z centered around uk, with a standard deviation of σk. However, for ST data, where each spot captures a mixture of cells with different cell states, Starfysh associates each spot i with a proportion vector, ck ∈ ΔK, representing the proportions of each cell state in that spot. Starfysh then constructs the low-dimensional representation zi with a mixture distribution that combines the cell state proportions ci and the cell state-specific representations uk: \({z}_{i}|{c}_{i},{u;}\,\sigma \sim N({\sum }_{k}{c}_{{ik}}{u}_{k},{\sum }_{k}{c}_{{ik}}{\sigma }_{k})\).
Following this, zi is transformed using a neural network f to obtain the normalized mean expression of each gene for spot i, which is further scaled by the library size li. The observed raw transcript count xig for gene g in spot i is then sampled from a negative binomial distribution centered around the upscaled mean.
Cell state proportions, ci, are also considered as random variables with a carefully crafted prior. Each cell state k ∈ [K] needs to be associated with a preliminary gene set signature, sk, which can be provided by the user or automatically discovered through archetypal analysis. By calculating the signature scores in each spot, denoted as A(xi, sk), Starfysh establishes a prior distribution over the cell state proportions in each spot. Specifically, the proportions of cell states ci are sampled from a Dirichlet distribution with a prior parameter α[A(xi, sk)]k∈[K]. For instance, if spot i highly expresses known marker genes for cell state k, then a larger value of A(xi, sk) will favor the probability of allocating cell state k for spot i according to the empirical Dirichlet prior parameter. The parameter α modulates the prior strength and represents the belief in the signature gene sets: a larger value corresponds to a stronger prior, while a smaller value results in a less constraining prior.
The generative model is defined as \(p(u,{c},{z},{l},{x})={\prod }_{k=1}^{K}p({u}_{k}){\prod }_{i=1}^{n}\)\(p({c}_{i})p({z}_{i}|{c}_{i},u)p({l}_{i})p({x}_{i}|{z}_{i},{l}_{i})\), with
p(uk) = Normal (0, 10ID)
p(ci; α, A) = Dirichlet (α⋅A), where α controls the prior strength on the signature scores A.
p(zi|ci, u; σ) = \({\rm{Normal}}(\sum _{k}{c}_{{ik}}{u}_{k},\sum _{k}{c}_{{ik}}{\sigma }_{k})\), where the parameters σk represent cell state-specific heterogeneity.
\(p({l}_{i}{\rm{;}}\widetilde{{l}_{i}})={\rm{logNormal}}(\widetilde{{l}_{i}},1)\), where \(\widetilde{{l}_{i}}\) is the locally averaged library size observed in spot i’s spatial neighborhood.
p(xi|zi, li) = \({\prod }_{g=1}^{G}p\left({x}_{{ig}}{\rm{|}}{l}_{i},{z}_{i}\right)\,\)
p(xig|li, zi; θg, f) = NegativeBinomial (lif(zi), θg), where θg denotes gene-specific dispersions and f is a neural network with a softmax output.
In the generative process, the parameters \(A,\alpha ,\widetilde{{l}_{i}}\) are fixed. The prior strength α is set by default to 50. Robustness analysis on α demonstrates that the model consistently outperforms the signature prior given a reasonable range (α ≥ 1) (Supplementary Fig. 2c). The optimal choice of the prior strength term depends on the specific dataset and markers. The locally averaged library size is computed as \(\widetilde{{l}_{i}}=\frac{1}{|{N}_{i}|}\,\sum _{j\in {N}_{i}}{\sum }_{g}\;{x}_{{jg}}\), where Ni is the set of spots physically located adjacent to spot i and also includes i. The cell state heterogeneities σk are initialized as 1, and the gene dispersions θg are initialized at random. Finally, the neural network f has by default one linear layer followed by a softmax. σk, θg and f are all learned during the inference.
Integration with histology images
Although histology hematoxylin-and-eosin (H&E) images are usually provided along with ST data (for example, the commercial Visium platform), current methods fail to use such modality in deconvolving cell types. Histology, however, provides useful information about morphology, tissue structure, cell density and spatial dependency of cells. Integrating histology and transcriptomes in a joint model is challenging, as the two data modalities are very different: the genome-level transcripts are high-dimensional vectors, whereas the histology data consist of multichannel images. Thus, it is essential to address the mismatch of these two types of data while preserving cell type-specific information of gene expression and cell morphology-specific information of histology images. The integrative approach in Starfysh is formulated with a deep variational information bottleneck26.
The original H&E images are first normalized to [0, 1] per channel. The alignment between H&E images and ST spot i produces the histology image patches \({y}_{i}\in {{\mathbb{R}}}^{P\times P\times C}\) (with P as the side length of the patch and C as the number of image channels, for example, C = 3 for RGB images and C = 1 for grayscale images). We set P = 26 by default to approximate the number of pixels surrounding each spot. The image patch yi is then flattened in the Starfysh model and assumed to be generated from the same latent variable zi that informs gene expression (Fig. 1c and Supplementary Fig. 1a) with a distribution p(yi|zi) parameterized by two neural networks gμ, gσ, for mean and variance of distribution for yi, respectively. Both consist of a linear layer followed by a batch normalization layer. They define:
$$p\left(\;{y}_{i}{\rm{|}}{z}_{i}\right)={\rm{Normal}}\left(\;{g}_{\mu }({z}_{i}),{g}_{\sigma }({z}_{i})\right).$$
Construction of the empirical prior
For cell states expected to reside in the tissue, Starfysh first filters out marker genes that are either unavailable in the ST data or not expressed in any spots to obtain binary variable \({s}_{k}\in {{\mathbb{R}}}^{G}\), k = {1,…, K}. Next, two priors are calculated before running Starfysh, including a prior for the cell state proportions that reflects their spot enrichment and a prior for the library size:
1.
Prior for the cell type proportion:
A(xi, sk) is defined as the enrichment score74 of the marker genes for cell state k at spot i. The score is first calculated with the Scanpy function ‘scanpy.tl.score_genes’, which computes the marker genes’ average expression and subtracts from it the average expression of a reference gene set G′ randomly sampled from binned expressions: \({A}^{{\rm{raw}}}({x}_{i},{s}_{k})=\frac{1}{|{s}_{k}{|}}{\sum }_{g\in G}\;{x}_{{ig}}\cdot {s}_{{kg}}-\frac{1}{{|G^\prime|}}{\sum }_{g\in {G^\prime}}{x}_{{ig}}\). We further transformed the scores using the function ReLU(x) = max(0, x) to ensure the positive constraints of Dirichlet parameters and make them comparable across spots (with ϵ defaulting as 1 × 10−5):
$$A({x}_{i},{s}_{k})={\rm{ReLU}}({A}^{{\rm{raw}}}({x}_{i},{s}_{k}))+\epsilon$$
$$A({x}_{i},{s}_{k})=\frac{A({x}_{i},{s}_{k})}{{\varSigma }_{k}A({x}_{i},{s}_{k})}.$$
For each cell state, the prior assigns unique enrichment scores across all spots, and we thus can define the anchor spots \(R\in {{\mathbb{R}}}^{S\times K}\) specifying the ranking of each spot i based the enrichment score \(A(:,{k})\) for each state \(k\), which can be updated with archetypal analysis detailed below.
2.
Prior for the library size:
Starfysh also considers the spatial dependency of spots when generating the prior for library size. \(\widetilde{{l}_{i}}=\frac{1}{|{N}_{i}|}\,\sum _{j\in {N}_{i}}{\sum }_{g\,}{x}_{{jg}}\), where \({N}_{i}\) is the set of spots physically located around the spot i, which includes all spots j such that \(|{r}_{{j}}-{r}_{i}| < w\), where w is an adjustable parameter for window size (default set to 3). \({r}_{i}\) is the spatial coordinates for spot i.
Archetypal analysis
Marker genes that represent cell states may be context dependent or unknown. To address these limitations and improve the characterization of tissue-dependent cell states, we developed a geometric preprocessing step, leveraging archetypal analysis75, to refine marker genes and identify new cell states.
Archetypal analysis fits a convex polytope to the observed data, finding the prototypes (archetypes) that are most adjacent to the extrema of the data manifold in high dimension. Previous works76,77,78 have applied archetypal analysis to scRNA-seq data to characterize meaningful cell types. In the context of ST, we hypothesize that the archetypes are closest to the purest spots that contain only one or the fewest number of cell states, while the rest of the spots are modeled as the mixture of the archetypes.
We applied the PCHA algorithm79 to find archetypes that best approximate the ‘extrema’ spots on a low-dimensional manifold. Specifically, let \(\hat{X}\in {{\mathbb{R}}}^{S\times G}\) be the normalized spot (S) by gene (G) expression from the original spatial count matrix. We further selected the first P = 30 principal components (\({X^\prime}\in {{\mathbb{R}}}^{S\times P}\,\)) to denoise the data. We denote matrices \(W\in {{\mathbb{R}}}^{S\times D},{B}\in {{\mathbb{R}}}^{D\times S}\) and \(H={BX^\prime}\in {{\mathbb{R}}}^{D\times P}\), where D represents the number of archetypes. The algorithm optimizes the parameters of W and B alternately, minimizing \({\Vert X{\prime} -WH\Vert }^{2}={\Vert X{\prime} -WBX{\prime} \Vert }^{2}\) subject to \({W}_{:,i} > 0\;\&\; {\sum }_{i=1}^{D}{W}_{:,i}=1\) and \({B}_{:,i} > 0\;\&\; {\sum }_{i=1}^{S}{B}_{:,i}=1\), where S spot counts and D archetypes are convex combinations of each other74. We applied Fisher separability analysis80 to infer the intrinsic dimension as its lower bound and iterated through different K values until the explained variance converges. We also implemented a hierarchical structure to fine tune the archetypes’ granularity with a resolution parameter r (ref. 81) (default set to 100). For archetype ai, i ∈ 2,…, D, if it resides within a Euclidean distance of r from any archetype aj, j ∈ 1,…, i − 1, we merge ai with the closest aj. The archetypes distant from each other are kept after the shrinkage iteration and used in subsequent steps.
We define archetypal communities as the r-nearest neighbors (same as the resolution parameter) to each archetype by constructing D clusters. Next, for each cluster i, we identify the top 30 marker genes by performing a Wilcoxon rank-sum test between in-group and out-of-group spots with Scanpy82. We then refine cell state markers by assigning archetypal communities to the closest cell states. First, we align D archetypal communities with the best one-to-one matched K cell states with stable marriage matching83 and then append the archetypal marker genes to the given cell state. Next, we update the anchor spots according to the updated gene list. Alternatively, to find new cell states, we rank the archetypal clusters from the most distant to the least distant to the anchor spots of known cell states, and the archetypal clusters distant from all anchor spots represent potential new states for further study.
The overall archetypal analysis algorithm in Starfysh is summarized as follows:
1.
Estimate the intrinsic dimension of the count matrix, and find k archetypes that identify the hypothesized purest spots.
2.
Find the N-nearest neighbors of each archetype, and construct archetypal communities.
3.
Find the most highly and differentially expressed genes for each archetypal community, and select the top n genes (default, n = 30) as the ‘archetypal marker genes’.
4.
If the signature gene sets are provided, align the archetypal communities to the best matched known cell types, update the signature genes by appending archetypal marker genes to the aligned cell type and recalculate the anchors.
5.
If the signature gene sets are absent, apply the archetypes and their corresponding marker genes as the signatures.
We found that archetypes alone are sufficient for disentangling major cell types but not fine-grained cell states (Supplementary Fig. 3e); however, when used as empirical priors to the deep generative model, they can guide the successful deconvolution of cell states (Supplementary Fig. 3a).
Starfysh structured variational inference
Starfysh uses variational inference to approximate the posterior. We first describe the inference procedure without integrating the histology variable yi. The posterior on variables uk (cell states representations) are approximated by mean-field distributions q(uk), while the posterior on the variables ci and li (cell state proportions and library size) are approximated by amortized mean-field distributions q(ci|xi) and q(li|xi). Next, for each spot i, we use a specially structured variational distribution q(zi|ci, xi) that uses cell state proportions to sample the latent variables zi. Because each spot contains multiple cell states with proportions ci, the structured variational distribution is assumed to decompose as a combination of cell state-specific terms (denoted by ζ(k, xi) for each cell state k), weighted by the proportion of cell states ci. The variational family factorizes in the form \(q(u,{c},{z},{l|x})={\prod }_{k=1}^{K}q({u}_{k}){\prod }_{i=1}^{n}q({c}_{i}|{x}_{i})q({l}_{i}|{x}_{i})q({z}_{i}|{c}_{i},{x}_{i}\;)\), parametrized by new variational parameters mk and vk and neural networks λ, γ and ζ as follows:
$$\begin{array}{ll}{\qquad\quad\,}q({u}_{k}) \, = \, {\rm{Normal}}({m}_{k},{v}_{k})\\ {\qquad\,\,}q({l}_{i}{\rm{|}}{x}_{i}) \, = \, {\rm{Normal}}\Big({\lambda }_{\mu }({x}_{i}),{\lambda }_{\sigma }({x}_{i})\Big)\\{\quad\,\,}q({c}_{i}{\rm{|}}{x}_{i}{\rm{;}}\,\alpha ) \, = \, {\rm{Dirichlet}}\Big(\alpha \cdot \gamma ({x}_{i})\Big)\\{\quad}q({z}_{i}{\rm{|}}{c}_{i},{x}_{i}) \, = \, {\rm{Normal}}\Big({\sum }_{k}{c}_{{ik}}\cdot {\zeta }_{\mu }(k,{x}_{i}),{\sum }_{k}{c}_{{ik}}\cdot {\zeta }_{\sigma }(k,{x}_{i})\Big).\end{array}$$
In summary, for each cell state k, the function ζ(k, xi) deconvolves the contribution of cell state k to the latent representation of zi. Each zi is a combination of the cell state contributions ζ(k, xi) weighted by the proportions ci. The cell state proportions are inferred with the neural network γ, which is guided toward the prior to match the cell type gene sets. The prior strength parameter α also premultiplies the neural network γ to obtain a posterior of similar strength, which helps for the gradient optimization.
Next, the standard variational inference that maximizes the evidence lower bound (ELBO) is performed84. The ELBO in our case can be written as:
$$\begin{array}{ll}{\rm{ELBO}}\left(q\right) \, = \, \mathbb{E}_{q(z,c,l,u{\rm{|}}x)}\left[\log \frac{p\left(x,z,l,c,u{\rm{;}}\alpha ,A,\widetilde{l},\sigma \right)}{q(z,c,l,u{\rm{|}}x)}\right]\\ \qquad\qquad\;\,\, = \, \,\mathbb{E}_{q\left(z,c,l,u,{|x}\right)}[\log p(x{\rm{|}}z,l\;)]\\ \qquad\qquad\qquad\, \, -\mathbb{E}_{q\left(c,|,x\right)q\left(u\right)}\left[{D}_{{\rm{KL}}}\Big(q({z|c},x)\| p(z{\rm{|}}u,c{\rm{;}}\sigma )\Big)\right]\\ \qquad\qquad\qquad\, \, -{D}_{{\rm{KL}}}\Big(q({c|x}{\rm{;}}\alpha ){\rm{||}}p(c{\rm{;}}\alpha ,A)\Big)\\ \qquad\qquad\qquad\, \, -{D}_{{\rm{KL}}}\Big(q(l{\rm{|}}x){\rm{||}}p(l{\rm{;}}\widetilde{l})\Big)-{D}_{{\rm{KL}}}\Big(q(u){\rm{||}}p(u)\Big),\end{array}$$
where DKL(p || q) is the Kullback–Leibler divergence between distribution p and q, defined as DKL(p || q) = 𝔼p(x)[log p(x)/q(x)]. We find the q that maximizes the ELBO by running stochastic gradient descent.
Starfysh structured variational inference with histology integration
To integrate the histology in the inference method, we model the approximate posterior over the latent low-dimensional representation z with the PoE distributions (Supplementary Fig. 1a). For each spot i, we denote the view-specific encoders qθ1 (zi|ci, xi) and qθ2 (zi|yi) from the corresponding expression xi and image patch yi, respectively. The expression view \({q}_{{\theta }_{1}}({z}_{i}|{c}_{i},{x}_{i})={\rm{Normal}}({\mu }_{1},{{\sigma }_{1}}^{2})\) is the same as described. For the histology view, zi is approximated by amortized mean-field distribution \({q}_{{\theta }_{2}}({z}_{i}|\;{y}_{i})={\rm{Normal}}({\mu }_{2},{{\sigma }_{2}}^{2})={\rm{Normal}}({\xi }_{\mu }({y}_{i}),{\xi }_{\sigma }({y}_{i}))\) with a single-layer neural network \(\xi\). For the joint latent variables \({z}_{i}\), the posterior distribution q(zi|ci, xi, yi) is parameterized as a product of view-specific Gaussian distributions as described in the original method26:
$${q}_{\theta }({z}_{i}{\rm{|}}{c}_{i},{x}_{i},{y}_{i})=\frac{{\mu }_{1}/{{\sigma }_{1}}^{2}+{\mu }_{2}/{{\sigma }_{2}}^{2}}{1/{{\sigma }_{1}}^{2}+1/{{\sigma }_{2}}^{2}}.$$
The previous ELBO can be updated with this new variational approximation for the joint modeling of histology and transcriptome. We leverage the information bottleneck approach26 to optimize the joint ELBO as well as the view-specific marginal ELBOs through a single objective function \({{\mathscr{L}}}_{{\rm{total}}}={{\mathscr{L}}}_{{\rm{joint}}}+a\cdot {{\mathscr{L}}}_{{\rm{marginal}}}\), where:
$$\begin{array}{ll}\quad{{\mathscr{L}}}_{{\rm{joint}}} \, = \, {\rm{ELBO}}({q}_{\theta })={E}_{{q}_{\theta }(z,l,c,u{\rm{|}}x,y)}\log \frac{p(x,y,z,l,c,u{\rm{;}}\sigma )}{{q}_{\theta }(z,l,c,u{\rm{|}}x,y)}\\\qquad\qquad= \, {E}_{{q}_{\theta }(z{\rm{|}}x,y){q}_{\theta }(l{\rm{|}}x)}\,\log p(x{\rm{|}}z,l)+{E}_{{q}_{\theta }(z{\rm{|}}x,y)}\log p(y{\rm{|}}z)\\ \qquad\quad\qquad-\,{E}_{{q}_{\theta }(c{\rm{|}}x){q}_{\theta }(u)}{D}_{{\rm{KL}}}\Big({q}_{\theta }(z{\rm{|}}c,x,y)\| p(z{\rm{|}}c,u{\rm{;}}\sigma )\Big)\\ {{\mathscr{L}}}_{{\rm{marginal}}} \, = \, {\rm{ELBO}}({q}_{{\theta }_{1}})+{\rm{ELBO}}({q}_{{\theta }_{2}}).\end{array}$$
The variational family for the joint objective function is factorized as \({q}_{\theta }(z,{l},{c},{u|x},{y})={q}_{\theta }({z|x},{y}){q}_{\theta }({l}|\;{y}){q}_{\theta }({c|x}){q}_{\theta }(u)\). Hyperparameter a (set by default as 5) balances the weights between joint and view-specific objectives26. The expression view \({\rm{ELBO}}({q}_{{\theta }_{1}})\) remains the same with above, and the histology view \({\rm{ELBO}}({q}_{{\theta }_{2}})\) is written as:
$$\begin{array}{ll}{\rm{ELBO}}({q}_{{\theta }_{2}}) \, = \, {E}_{{q}_{{\theta }_{2}}(z{\rm{|}}y)}\log \frac{p(y,z,c,u{\rm{;}}\sigma )}{{q}_{{\theta }_{2}}(z{\rm{|}}y)}\\\qquad\qquad\quad\,= \, {E}_{{q}_{{\theta }_{2}}(z{\rm{|}}y)}\log p(y{\rm{|}}z)-{E}_{{q}_{{\theta }_{2}}\left(c{|}y\right){q}_{{\theta }_{2}}\left(u\right)}{D}_{{\rm{KL}}}\left({q}_{{\theta }_{2}}(z{\rm{|}}\;y){\rm{||}}p(z{\rm{|}}u,c{\rm{;}}\,\sigma )\right).\end{array}$$
The same conditional prior p(z|c, u; σ) is applied across the joint and view-specific ELBOs. We find the \(\{{q}_{\theta },{q}_{{\theta }_{1}},{q}_{{\theta }_{2}}\}\) that maximize \({{\mathscr{L}}}_{{\rm{total}}}\) by running stochastic gradient descent.
Starfysh implementation
The Starfysh model is implemented as a Python package using PyTorch85 with the Adam86 optimizer. The model by default is trained for 200 epochs with a learning rate at 0.001. During the training, the learning rate decays, guided by an exponential scheduler with the multiplicative factor set as 0.98. Kaiming initialization is applied to all neural network parameters. Hyperparameters are adjustable in the package.
Prediction of cell state-specific expression
To predict cell state-specific expression, we use the decoder in which the parameters have been learned and optimized by the variational inference. The proportion ci is adjusted to 1 for a specific cell state and 0 for other cell states. Reconstructed expression and histology are considered as cell state-specific expression and histology.
Integration of multiple samples
To effectively integrate multiple samples, Starfysh initially identifies anchors in each sample by combining spots enriched for cell types and archetypal communities. The gene markers for each sample are then updated based on the newly defined anchors. Subsequently, we aggregate the gene markers for each cell type across all samples. These updated markers are used to calculate priors for the cell state proportions when fitting to all samples simultaneously. Priors for library size are separately calculated for spots in each sample. Finally, transcriptomic counts along with their corresponding histological patches are incorporated as inputs to train an integrated model, synergizing data across samples.
Simulation of ST data
We construct our ST simulations using mixtures of scRNA-seq data previously collected from primary TNBC tumor tissues (CID44971_TNBC)18 with different levels of cell type granularities.
Spatially dependent simulation
To address spatial dependencies among neighboring spots, we adopt the pipeline from Cell2location8. Specifically, synthetic ST spots are defined on a 50 × 50-pixel grid. For the major cell type simulation, we select five cell types (CAFs, cancer epithelial cells, myeloid cells, normal epithelial cells, T cells) from the reference scRNA-seq data and simulate their spatial proportions with separate 2D Gaussian process models (Supplementary Fig. 2a). We further assign an expected library size for each spot with a γ distribution fitted from the real ST dataset, representing the spatial variation of capture rates among spots. For each spot, we then sample single-cell transcriptomes from the reference by searching for candidate cells with a library size closest to the expected library size. We follow the same procedure to generate another ten-cell type simulation with finer cell states: basal cells, inflammatory CAFs, myofibroblast CAFs, endothelial cells, immature PVL cells, central memory T cells, Treg cells, activated CD8+ T cells, memory B cells and plasmacytoid dendritic cells.
Simulation with paired histology images
We further generate pseudo-histology images paired with the aforementioned major cell type simulation to verify multimodel integration. Specifically, we design a supervised encoder–decoder neural network model (Supplementary Fig. 1c), with real ST expression as input and their histology images as output. First, the expression matrix is projected to a low-dimensional latent space with a ResNet18 encoder, and the histology image is reconstructed with a standard linear decoder with dimension transformation. Two thousand image patches and corresponding expression matrices were trained from 14 ST samples, and an extra 500 images patches were used for held-out validation. The learning rate was set as 0.001 with the Adam optimizer for training. Mean-squared loss was used to fit the predictions to the real ST images. The final paired synthetic histology images were generated by running the trained model.
Signature gene set retrieval in simulated data
For fair benchmarking not favoring Starfysh, we build the signature gene sets in an unbiased fashion by choosing the top 30 differentially expressed genes for each cell type (highest log (FC) scores) across 20 breast cancer scRNA-seq samples reported by Wu et al.18.
Benchmarking of Starfysh and comparison to other methods with simulated ST data
We benchmarked Starfysh against reference-based (DestVI, Cell2location, Tangram, BayesPrism) and reference-free (CARD, BayesTME, STdeconvolve) deconvolution methods with the aforementioned simulations. For the reference-based method, we used paired scRNA-seq data for sample TNBC sample CID44971 as the reference. For reference-free methods without inferred cell state annotations, we report the best alignment with the ground truth proportions upon permutation.
For each deconvolution, we trained Starfysh with three independent restarts and selected the model with the lowest \({{\mathscr{L}}}_{c}\). The variational mean q(cik|xi; α) is used as the inferred cell state proportions.
For BayesPrism, we followed the tutorial on the BayesPrism website: https://www.bayesprism.org/pages/tutorial_deconvolution. We subsetted the common protein-coding genes between the scRNA-seq and ST data with highly variable gene selection by default. We ran the BayesPrism Gibbs sampler ‘run.prism’ with four cores and extracted the updated cell type fractions θn for deconvolution.
For Cell2location, we followed the tutorial on the Cell2location website: https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html. We trained the reference regression with 1,000 epochs and spatial mapping models with 10,000 epochs, in which ELBO losses were ensured. The normalized 5% quantile values of the posterior distribution \({\hat{w}}_{{sf}}=\frac{{w}_{{sf}}}{{\varSigma }_{f}{w}_{{sf}}}\) were used for deconvolution.
For DestVI, we followed the DestVI tutorial with default parameters at https://docs.scvi-tools.org/en/stable/tutorials/notebooks/DestVI_tutorial.html.
For Tangram, we followed the Tangram tutorial using default settings: https://github.com/broadinstitute/Tangram/blob/master/tutorial_tangram_with_squidpy.ipynb. We found the optimal alignment for scRNA-seq profiles with 1,000 epochs.
For CARD (reference free), we followed the CARD reference-free tutorial: https://yingma0107.github.io/CARD/documentation/04_CARD_Example.html. Default settings were used to generate cell type proportions (minCountGene = 100 and minCountSpot = 5).
BayesTME (reference free) deconvolves cell types with a hierarchical probabilistic model that corrects technical artifacts. We followed the official BayesTME tutorial with default parameters: https://github.com/tansey-lab/bayestme/blob/main/notebooks/deconvolution.ipynb.
For STdeconvolve (reference free), we followed the tutorial on the STdeconvolve website (https://jef.works/STdeconvolve/) and selected the top 1,000 overdispersed genes from the input matrix. We set the optimal number of cell types K to 5 and 10 for the major and fine cell type simulations, respectively. The predicted cell type proportions were obtained from the output ‘deconProp’.
Quantification of performance in deconvolution of cell types
The performance of each method was summarized by the RMSE and Jensen–Shannon divergence (JSD) against the ground truth to quantify per-spot accuracy (Supplementary Fig. 2d,e):
$$\begin{array}{ll}{\rm{RMSE}}\left({{c}_{i}}^{{gt}},{{c}_{i}}^{{\rm{pred}}}\right) \, = \, \sqrt{\frac{\mathop{\sum }\nolimits_{k=1}^{K}{\left({{c}_{{ik}}}^{{gt}}-{{c}_{{ik}}}^{{\rm{pred}}}\right)}^{2}}{K}}\\\quad\; {\rm{JSD}}\left({{c}_{i}}^{{gt}},{{c}_{i}}^{{\rm{pred}}}\right) \, = \, \frac{1}{2}{D}_{{\rm{KL}}}\left({{c}_{i}}^{{gt}}{\rm{||}}{{c}_{i}}^{{\rm{pred}}}\right)+\frac{1}{2}{D}_{{\rm{KL}}}\left({{c}_{i}}^{{\rm{pred}}}{\rm{||}}{{c}_{i}}^{{gt}}\right),\end{array}$$
where \({{c}_{i}}^{{gt}},{{c}_{i}}^{{\rm{pred}}}\in {\varDelta }^{K}\) represent the ground truth and predicted cell type compositions in spot i. We report the average RMSE across all spots as the overall performance for each method (Fig. 1d).
Benchmarking of Starfysh and comparison to other methods with real ST data
We further benchmarked Starfysh with reference-based (Cell2loation and BayesPrism) and reference-free (STdeconvolve) deconvolution methods on TNBC sample CID44971 ST data (Supplementary Fig. 3b–d). We calculated the correlation \(A\in {{\mathbb{R}}}^{K\times K}\) between the average expression of gene sets (normalized to sum to 1 per spot) (Supplementary Table 2) and the deconvolution profile for each cell state:
$$\begin{array}{ccc}{A}_{{kl}} & = & {\rm{Corr}}\Big({{c}_{:k}}^{{\rm{sig}}},{{c}_{:l}}^{{\rm{pred}}}\Big)\\ {\bar{c}}_{{ik}} & = & \frac{{\sum }_{g}{x}_{{ig}}\cdot {s}_{{kg}}}{{\sum }_{g}{s}_{{kg}}},{c}_{{ik}}^{{\rm{sig}}}=\frac{{\bar{c}}_{{ik}}}{\mathop{\sum }\nolimits_{k=1}^{K}{\bar{c}}_{{ik}}},\end{array}$$
where \({c}_{:k}^{{\rm{sig}}},{c}_{:l}^{{\rm{pred}}}\in {{\mathbb{R}}}^{S}\) represent signature marker’s expression and deconvolution proportions for cell states k and l, respectively.
For Starfysh, we followed the same procedure from the simulation benchmark and reported the variational mean q(cik|xi; α) as the deconvolution profile.
For both BayesPrism and Cell2location, we followed the same procedures as the simulation benchmark, except for replacing the synthetic ST data with real ST data from TNBC sample CID44971. We applied the TNBC sample CID44971 scRNA-seq annotation from the ‘subset’ classification tier from Wu et al.18. For correlation calculation, intersections between single-cell annotations18 and our signature cell types are shown, as BayesPrism and Cell2location only deconvolve cell types that appear in the reference.
For STdeconvolve, we iterated the number of factors (k) from 20 to 30 and chose the optimal k as 30 given the lowest perplexity following the official tutorial. Because STdeconvolve does not explicitly annotate factors, we performed hierarchical clustering between factors (x axis) and cell types (y axis).
We applied archetypal analysis (Starfysh) to the ST data and identified 18 distinct archetypes. We reported the overlapping percentage between anchor spots and archetypal communities for each cell state (Supplementary Fig. 3e).
Quantification of performance in deconvolution of cell states in real ST data
Performance in disentangling cell states was evaluated using the Frobenius norm \(d={\Vert A-{A}^{\rm{sig}}\Vert }_{F}\) as the distance between the deconvolution-to-signature correlation A to the ‘reference’ matrix \({{A}_{{kl}}}^{{\rm{sig}}}={\rm{Corr}}({{c}_{:k}}^{k},{{c}_{:l}}^{l})\), defined as the correlation between signature expressions across cell states. To ensure a fair comparison across reference-based and reference-free methods, we reported a Frobenius norm distance computed as follows: for each method, (1) 1,000 10 × 10 submatrices {A(1),…, A(1,000)} were sampled from the original correlation matrix A without replacement with randomly permuted cell states; (2) an array of Frobenius norm distance \(\overrightarrow{d}=({d}^{(1)},\ldots ,{d}^{(1,000)}),\,{d}^{(i)}={\Vert {A}^{(i)}-{A}^{{\rm{sig}}(i)}\Vert }_{F}\) was computed; and (3) we reported the average value of \({d}_{i}\) in Supplementary Fig. 3a–d. To test the improvement of Starfysh, we performed a Mann–Whitney U-test between the distance array of Starfysh against the combination of all other methods (BayesPrism, Cell2location, STdeconvolve).
For reference-free methods in which the number of inferred factors and the number of cell types may differ, we permuted the correlation matrix such that each cell type (row) was aligned with the factor (column) with the highest correlation score, where the diagonal entries were sorted in descending fashion.
Runtime comparison across deconvolution methods on real ST data
Runtimes of the core deconvolution function in each method were measured on the same machine with 12-core AMD Ryzen 9 3900X CPU and a GeForce RTX 2080 GPU:
Starfysh: run_starfysh (GPU-enabled)
BayesPrism: run.prism
Cell2location: RegressionModel.train(),Cell2location.train() (GPU-enabled)
STdeconvolve: fitLDA
Starfysh validation with Xenium-mapped ST data
We further applied Starfysh to a recent breast cancer ST dataset, for which integrated multicellular (Visium, replicate 1) and subcellular in situ (Xenium) spatial technologies were performed on the same formalin-fixed, paraffin-embedded tissue blocks29. We first aligned the Visium H&E images and spots to the paired Xenium H&E images with SIFT registration87. The ground truth deconvolution profile was then constructed by assigning spots to their corresponding Xenium cells annotated by Janesick et al.29. A total of 2,567 spots with nine major cell types were kept after filtering out spots with unannotated cells (Supplementary Fig. 4a). Benchmarking metrics were computed the same way as for the simulation data. Original datasets as well as the signatures used by Starfysh are publicly available at https://www.10xgenomics.com/support/in-situ-gene-expression/documentation/steps/onboard-analysis/at-a-glance-xenium-output-files.
Starfysh validation with ST data of mouse cortex and human lymph node
We applied Starfysh to mouse brain data adapted from Cell2location8 and used the marker genes provided by the paper, which are collected from literature with known regional marker genes or the Allen Brain Atlas. Histology integration is applied in this dataset also. Starfysh successfully recognized enriched regions such as Bergmann glia of the cerebellum (ACBG), cortex pyramidal layer 6 (TEGLU3), the basolateral amygdala (TEGLU22) and the hippocampus (TEGLU24) (TEGLU, telencephalon projecting excitatory neurons; Supplementary Fig. 6a). Starfysh also reconstructed the histology data resembling original images (Supplementary Fig. 6b). Inferred spatial hubs recapitulated the brain regions identified from Cell2location (Supplementary Fig. 6c), such as the thalamus (hubs 8 and 9), the hypothalamus (hubs 7 and 19), the cortex (hubs 0, 1 and 5), the amygdala (hubs 6 and 12), the hippocampus (hubs 10 and 20), the striatum (hub 11) and white matter (hubs 4 and 13).
We also applied Starfysh to human lymph nodes with gene signatures from a comprehensive atlas of 34 cell types in human lymphoid organs88,89,90. The results recapitulated the identification of T cell and B cell zones and germinal centers with dark-zone, light-zone and follicular dendritic cells reported as in Cell2location (Supplementary Fig. 6d). Starfysh also distinguished blood vessel zones, similar to the results in Cell2location. The identified spatial hubs (Supplementary Fig. 6e) showed similar alignment with Cell2location (scRNA-seq reference based)-defined spatial clusters through the MIC (Supplementary Fig. 6e,f).
Starfysh validation with spatiotemporal analysis of prostate cancer
To evaluate Starfysh’s power in unraveling mechanisms in more complicated scenarios, such as spatiotemporal ST datasets, we applied it to ST datasets from prostate cancer tissues undergoing AD therapy30. ST profiling provided a unique perspective on the tumor and microenvironment in this specific prostate cancer, called castration-resistant PCa, a type with challenging tumor grade classification and unpredictable treatment outcomes.
Unlike the published study that used spatial transcriptome decomposition91 for patient-by-patient spatiotemporal analysis, Starfysh demonstrated superior efficacy in identifying more interpretable niches. It integrated samples from three patients with four biopsies each and two biological replicates per biopsy and samples from both pretreatment and post-treatment stages (Supplementary Fig. 7a,b).
UMAP visualization of the joint space of inferred cell type proportion highlighted specific features such as clustering of tumor cells, immune cells and stromal cells (Supplementary Fig. 7c). We defined 17 hubs within this joint space (Supplementary Fig. 7d), and their spatial distribution illustrated changes before and after AD treatment across patients and revealed similarities across replicates (Supplementary Fig. 7e). Each hub represented aggregations of specific cell types (Supplementary Fig. 7f), with ranking based on tumor cell proportions including tumor-enriched hubs (Supplementary Fig. 7g). For instance, hub 0 was enriched with prostate cancer and stromal cells such as CAFs and perivascular cells, whereas hub 1 had predominantly cancer cells.
Patient-specific variances were evident in the composition of these hubs, particularly in their response to AD treatment. Starfysh’s analysis aligned with clinical data, categorizing patients into responders (patient 1), moderate responders (patient 2) and nonresponders (patient 3). For example, tumor-enriched hub 0 predominated in the nonresponder (patient 3), while hub 15 was specific to the moderate responder (patient 2) (Supplementary Fig. 7h). Differential gene expression analysis of hub 0 revealed enrichment in EMT pathways and myogenesis, indicating resistance to treatment (Supplementary Fig. 7h,i). Additionally, hub 0 exhibited low AR activity (Supplementary Fig. 7j), aligning with findings that stromal cells adjacent to resistant clusters lacked androgen receptor expression and were enriched with EMT pathways. Starfysh not only identified similar regions but also highlighted specific cell type infiltrations, including those of CAFs and perivascular cells. Moreover, ST data indicated a trend from tumor hubs (hubs 13 and 15) to hub 0 upon treatment, which is beneficial for interpatient analysis.
Breast tumor ST data collection and analysis
Sample collection and preparation
Tissues were collected from women undergoing surgery for primary breast cancer. All samples were obtained after informed consent and approval from the institutional review board at Memorial Sloan Kettering Cancer Center. Samples were obtained using standard-of-care procedures. The samples were embedded fresh in Scigen Tissue-Plus O.C.T. Compound (Fisher Scientific) and stored at −80 °C before sectioning. Cryosections (10 μm) were mounted on Visium spatial gene expression slides (10x Genomics, 1000184). Two individual tumors were mounted in duplicate on the four 6.5-mm × 6.5-mm capture areas. The samples were processed as described in the manufacturer’s protocols.
Spatial transcriptomics by 10x Genomics Visium
Visium Spatial Gene Expression slides prepared by the Molecular Cytology Core at MSKCC were permeabilized at 37 °C for 6 min, and polyadenylated mRNA was captured by oligonucleotides bound to the slides. Reverse transcription, second-strand synthesis, complementary DNA (cDNA) amplification and library preparation proceeded using the Visium Spatial Gene Expression Slide & Reagent Kit (10x Genomics, 1000184) according to the manufacturer’s protocol. After evaluation by real-time PCR, cDNA amplification included 13–14 cycles; sequencing libraries were prepared with 15 cycles of PCR. Indexed libraries were pooled in an equimolar fashion and sequenced on a NovaSeq 6000 instrument in a PE28/120 run using the NovaSeq 6000 SP Reagent Kit (200 cycles) (Illumina). An average of 228 million paired reads were generated per sample.
Tissues were stained with H&E, and slides were scanned on a Pannoramic MIDI scanner (3DHISTECH) using a ×20, 0.8-NA objective.
Quality metrics for the collected ST data are shown in Supplementary Table 5.
CODEX data collection and preprocessing
Four fresh-frozen samples, adjacent slides with P3A_MBC, P3B_MBC, P4A_MBC and P4B_MBC, were processed for PhenoCycler (CODEX) imaging in Enable Lab (https://www.enablemedicine.com). Samples were prepared and stained, and images were acquired following CODEX User Manual Rev C (https://www.akoyabio.com) at Enable Medicine. Twenty-three antibodies were used for staining in this study (Supplementary Table 6). Image data were preprocessed using commercial software (Enable Medicine).
Analysis of ST data from breast tumor tissues
Data preprocessing
Starfysh is compatible with Scanpy82 and preprocesses the raw count matrix as input without normalization after filtering out ribosomal and mitochondrial genes. To account for expression sparsity and noise, we selected the top 2,000 highly variable genes including specified marker genes.
Identification of tumor-associated anchors
Tumor-associated archetypes were defined as the anchor spots highly associated with tumor cell types. First, an initial set of cell state-enriched spots (for example, 60 spots for each cell state) and M archetypes were identified based on the provided marker gene list and the PCHA algorithm, respectively. Because archetypes are vertices non-overlapping with observed data, the r = 20 nearest-neighbor spots for each archetype were identified, obtaining a set of ‘archetypal communities’ as a 20 × M matrix. Next, we aligned archetypal communities with the best one-to-one matched K cell states with the stable marriage algorithm. Anchor spots were then updated based on the new marker gene list. The final anchors that are associated with any tumor cell gene set (including TNBC, MBC, LumA, LumB and ER+) were considered as TAAs (Figs. 2d,h and 4c).
Diffusion component analysis
Diffusion components were computed using normalized gene counts as the input. Computation was performed with the Scanpy package. Scanpy computes diffusion components by first constructing a nearest-neighbor graph from the high-dimensional input data. Next, it simulates a diffusion process on the graph.
Definition of hubs
Hubs were defined as groups of spots with a similar composition of cell states. To integrate ST samples from different patients, anchors were defined on merged data from all samples, and Starfysh then inferred the cell state proportion and latent variables for each spot in each sample using the same anchor set. Spots were then clustered according to the inferred cell state proportion using PhenoGraph clustering (Supplementary Fig. 11c).
Entropy of spots
We used an entropy-based metric previously used for batch correction in single-cell data35 for evaluating the integration of samples. The Shannon entropy of spots denotes mixing of spots across samples. Specifically, we constructed a kNN graph for each spot i to determine its nearest neighbors using Euclidean distance in the Starfysh latent space (z). These nearest-neighbor spots formed a distribution of patients \((m\in \{1,\ldots 14\}\,)\) for the overall 14 patients studied in this paper, represented as \({{e}_{{i}^{}}}^{m}\). The Shannon entropy is calculated as \({H}_{i}=-{\sum }_{m=1}^{14}{{e}_{i}}^{m}\log {{e}_{i}}^{m}\). Higher entropy represents higher localized sample mixing across patients (Fig. 3d).
Kendall’s τ correlation
Kendall’s τ correlation is a metric for measuring the ordinal association between two measured quantities. We used this metric to quantify the heterogeneity of TAAs. Genes for TAAs were ranked based on differential expression scores for each sample. Samples having similar TAAs were assumed to have a similar rank of differential genes, thus having higher scores of Kendall’s τ correlation (Fig. 2p).
Copy number variation
Copy number variation was performed following the instructions for inferCNV (https://github.com/broadinstitute/inferCNV). The inferred copy number variation cluster lineage was plotted as a dendrogram tree using toytree92.
Definition of intratumoral, peritumoral and stromal regions
We applied Starfysh to TNBC and MBC samples to avoid the bias introduced by those ER+ samples and redefined the hubs among six TNBC and four MBC samples. Intratumoral regions were defined as hubs with the mean of inferred proportions of all tumor states being larger than 0.2 (Supplementary Fig. 13b). Histology information was also considered to confirm the enrichment of tumor cells in these regions. Other hubs were ranked by the average distance (unit, pixel) to intratumoral hubs. With the incorporation of histology and total proportion of immune cells and stromal cells, hub 8 was considered as the boundary between peritumoral regions and stromal regions (Supplementary Fig. 13c). To summarize, hubs 5, 2, 11 and 12 were considered as intratumoral hubs, hubs 0, 9, 3, 6 and 8 were considered as peritumoral hubs, and hubs 1, 7, 4 and 10 were recognized as stromal hubs. Notably, the determined peritumoral regions were shared across all samples, while some intratumoral regions and stromal regions were sample specific (Supplementary Fig. 13a,d and Fig. 4b).
Spatial correlation
To measure colocalization between cell states, we slightly modified the spatial cross-correlation index (SCI)54. SCI is defined as:
$${\rm{SCI}}\Big({S}_{x},{S}_{y}\Big)=\frac{N}{2\mathop{\sum }\nolimits_{i}^{N}\mathop{\sum }\nolimits_{j}^{N}{\tau }_{{ij}}}\frac{\mathop{\sum }\nolimits_{i}^{N}\mathop{\sum }\nolimits_{j}^{N}{\tau }_{{ij}}({x}_{i}-\bar{x})(\;{y}_{i}-\bar{y})}{\sqrt{\mathop{\sum }\nolimits_{i}^{N}{({x}_{i}-\bar{x})}^{2}}\sqrt{\mathop{\sum }\nolimits_{j}^{N}{(\;{y}_{j}-\bar{y})}^{2}}},$$
where x and y denote the predicted proportion for two cell states Sx and Sy, i and \(j\in [1,\mathrm{.}.,N]\) are indexes of spots within a certain hub and \(\bar{x},\bar{y}\) are the mean proportion of two cell states in the hubs. We defined the weight matrix \(\tau\) as information between adjacent neighbors, as τij = 1 if the coordinate distance of spot i and spot j was less than \(\sqrt{3}\), else wij = 0.
Inference of intercellular ligand–receptor interactions
To investigate the intercellular interactions in a hub, the top 5% spots with the highest inferred proportion of each cell state in the hub were selected. CellPhoneDB55 was then applied to the selected spots with normalized gene expression. Visualization was performed with the Sankey diagram with plotly and the Circos plot93.
Diffusion map analysis with intratumoral hubs
Intratumoral hubs were selected for diffusion map analysis (Fig. 2h), and diffusion map components showing gradients between intratumoral hubs were chosen. Diffusion map coordinates were used as inputs for the trajectory inference algorithm SCORPIUS49. Modules of genes that significantly (q values < 0.05) contributed to the trajectory of transitions between tumor hubs were identified (Fig. 2i). Over-representation analysis was conducted to understand the biological processes via the Python package gseapy with gene sets including KEGG_2021_Human, GO_Biological_Process_2021 and Hallmark.
Genes with diffused expression patterns
Treg-enriched (proportion > 0.05) spots in intratumoral hubs were selected, and the distance between all spots to the selected spots was calculated with the ‘sklearn.neighbors’ Python package with the function KDTree. For each gene, expression of spots with the same distance was averaged and smoothed with a window size of 7 for each sample. The mean and s.d. of expression across all samples were computed and smoothed with ‘Gaussian_filter1d(sigma = 1.5)’ with the Python package SciPy (mean and s.d. are shown as a solid line and shaded area in Fig. 4i).
CODEX data analysis
Raw CODEX images were segmented to enable cell-level quantification from biomarker signals. The results were then checked with quality control to filter out segmentation artifacts. The data thus were transformed as a U × P matrix, where U is the number of single cells detected in the CODEX images and P represents the number of antibodies profiled. The data were then processed by quantile normalization, asinh transform and z-score normalization. PCA, neighbor graphs and UMAP were performed sequentially on single-cell CODEX data (Supplementary Fig. 15a). Annotations of cell types were based on the clustering and distribution of normalized CODEX data such as Ki67 and CD3 expression (Supplementary Fig. 15b,c and Supplementary Table 6). Annotations were validated with a dendrogram tree of the clusters (Supplementary Fig. 15d). The single-cell CODEX was also visualized in the spatial arrangement aligning with the histology and ST Visium data (Supplementary Fig. 15e and Fig. 5c).
Spatial profiling of T cell receptors
To capture spatial TCR clonotype information, we adapted an established protocol that allows spatial mapping of TCRs from cDNA libraries of our samples46. The process involves three qPCR steps: (1) the first step begins with 43 pooled TCRB primers and the truncated read 1 primer (2 µl cDNA, 1 µl of each forward and reverse primers and 12.5 µl NEBNext Master Mix, 0.5 µl SYBR and 8 µl water). (2) The second step uses 43 TCRB primers with R2 sequences and the truncated read 1 primer with 1 µl of the PCR product from step 1. (3) The third step involves indexed TruSeq P5 primers and indexed Nextera P7 primers, with 1 µl of the PCR product from step 2. All PCR steps were stopped before the plateau phase, and the PCR products were cleaned with 0.8× AMPure beads and eluted in 50 µl.
Sequencing was conducted on an Illumina NextSeq 500 instrument with the following cycle settings: R1 28, I1 10, I2 10, R2 110. Clonotype analyses were performed with MiXCR.
The PCR cycling conditions are as follows: initial denaturation, 98 °C for 3 min; denaturation, 98 °C for 15 s; annealing, 62 °C (72 °C for qPCR step 3) for 20 s; extension, 72 °C for 1 min; repeat of the denaturation step to the extension step before the plateaus phase; final extension, 72 °C for 1 min.
We further provide the full spatial TCR primer sequences in Supplementary Table 8.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.