scDRP package

Submodules

scDRP.data module

class scDRP.data.CombinedDataset(X, a, t, c)

Bases: Dataset

Dataset for the combined data

__init__(X, a, t, c)
Parameters:
  • X – numpy array of shape (n_samples, n_features)

  • a – one-hot encoded numpy array of shape (n_samples, num_perturbations)

  • t – one-hot encoded numpy array of shape (n_samples, num_celltypes) or None

  • c – one-hot encoded numpy array of shape (n_samples, batch_dim) or None

class scDRP.data.LabelDataset(a, t)

Bases: Dataset

Dataset for the combined data

__init__(a, t)
Parameters:
  • a – one-hot encoded numpy array of shape (n_samples, num_perturbations)

  • t – one-hot encoded numpy array of shape (n_samples, num_celltypes) or None

class scDRP.data.LatentDataset(z, c)

Bases: Dataset

Dataset for the combined data

__init__(z, c)
Parameters:
  • z – numpy array of shape (n_samples, latemt_dim)

  • c – one-hot encoded numpy array of shape (n_samples, batch_dim) or None

scDRP.loss module

class scDRP.loss.HSICLoss(sigma_x: float | None = None, sigma_y: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Bases: Module

Hilbert-Schmidt Independence Criterion (HSIC) for Independence Testing.

Measures the dependence between X and Y. HSIC is zero if and only if X and Y are independent (assuming characteristic kernels like Gaussian RBF).

Based on: Gretton et al. “Measuring Statistical Dependence with Hilbert-Schmidt Norms” (2005)

Parameters:
  • sigma_x (float, optional) – Fixed bandwidth for X. If None, uses automatic selection.

  • sigma_y (float, optional) – Fixed bandwidth for Y. If None, uses automatic selection.

  • sigma_method (str) – Method for automatic sigma selection when sigma_x/sigma_y is None. - ‘median’: Median heuristic (default, recommended) - ‘scott’: Scott’s rule

  • sigma_scale (float) – Additional scaling factor for computed sigma. Default: 1.0

Example

>>> # Automatic sigma selection (recommended)
>>> loss_fn = HSICLoss()
>>> X = torch.randn(100, 5)
>>> Y = torch.randn(100, 3)
>>> loss = loss_fn(X, Y)
>>>
>>> # Manual sigma specification
>>> loss_fn = HSICLoss(sigma_x=1.0, sigma_y=0.5)
>>> loss = loss_fn(X, Y)
__init__(sigma_x: float | None = None, sigma_y: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Initializes the HSIC loss.

Parameters:
  • sigma_x – Fixed bandwidth for X. If None, auto-computed per forward pass.

  • sigma_y – Fixed bandwidth for Y. If None, auto-computed per forward pass.

  • sigma_method – Method for automatic bandwidth selection.

  • sigma_scale – Global scaling factor applied to all computed sigmas.

_compute_sigma(x: Tensor, fixed_sigma: float | None) float

Computes or returns sigma for a given tensor.

Parameters:
  • x – Input tensor.

  • fixed_sigma – Pre-specified sigma value. If None, auto-compute.

Returns:

Sigma value to use.

_gaussian_kernel(x: Tensor, y: Tensor, sigma: float) Tensor

Computes Gaussian RBF kernel matrix.

K(x, y) = exp(-||x - y||^2 / (2sigma^2))

Parameters:
  • x – Tensor of shape [batch_size, dim_x].

  • y – Tensor of shape [batch_size, dim_y].

  • sigma – Bandwidth parameter.

Returns:

Kernel matrix of shape [batch_size, batch_size].

_median_heuristic(x: Tensor) float

Computes sigma using median heuristic.

sigma^2 = median(||x_i - x_j||^2) * sigma_scale

Parameters:

x – Tensor of shape [batch_size, dim].

Returns:

Optimal sigma value.

_scotts_rule(x: Tensor) float

Computes sigma using Scott’s rule.

sigma = n^(-1/(d+4)) * std(X) * sigma_scale

Parameters:

x – Tensor of shape [batch_size, dim].

Returns:

Optimal sigma value.

forward(X: Tensor, Y: Tensor) Tensor

Computes HSIC between X and Y.

Parameters:
  • X – Input tensor of shape [batch_size, dim_x].

  • Y – Input tensor of shape [batch_size, dim_y].

Returns:

HSIC value (scalar). Zero indicates independence.

Raises:

ValueError – If batch sizes don’t match.

class scDRP.loss.NOCCOLoss(epsilon: float = 0.0001, sigma_x: float | None = None, sigma_y: float | None = None, sigma_z: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Bases: Module

Normalized Cross-Covariance Operator (NOCCO) Loss for Conditional Independence.

Measures the conditional dependence between X and Y given Z using the Hilbert-Schmidt norm of the normalized conditional cross-covariance operator.

Based on: “Kernel Measures of Conditional Dependence” (Fukumizu et al., NIPS 2007)

Parameters:
  • epsilon (float) – Regularization parameter for matrix inversion stability. Default: 1e-4

  • sigma_x (float, optional) – Fixed bandwidth for X. If None, uses automatic selection.

  • sigma_y (float, optional) – Fixed bandwidth for Y. If None, uses automatic selection.

  • sigma_z (float, optional) – Fixed bandwidth for Z. If None, uses automatic selection.

  • sigma_method (str) – Method for automatic sigma selection. Options: - ‘median’: Median heuristic (default, recommended) - ‘scott’: Scott’s rule

  • sigma_scale (float) – Additional scaling factor for computed sigma. Default: 1.0

Example

>>> # Automatic sigma selection (recommended)
>>> loss_fn = NOCCOLoss()
>>> X, Y, Z = torch.randn(100, 5), torch.randn(100, 3), torch.randn(100, 2)
>>> loss = loss_fn(X, Y, Z)
>>>
>>> # Manual sigma specification
>>> loss_fn = NOCCOLoss(sigma_x=1.0, sigma_y=1.0, sigma_z=0.5)
>>> loss = loss_fn(X, Y, Z)
__init__(epsilon: float = 0.0001, sigma_x: float | None = None, sigma_y: float | None = None, sigma_z: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Initializes the NOCCO loss.

_center_gram(K: Tensor) Tensor

Centers the Gram matrix in feature space.

_compute_normalized_cov(G_Y: Tensor, G_X: Tensor, n: int) Tensor

Computes normalized cross-covariance operator V_Y_X.

_compute_sigma(x: Tensor, fixed_sigma: float | None) float

Computes or returns sigma for a given tensor.

_gaussian_kernel(x: Tensor, y: Tensor, sigma: float) Tensor

Computes Gaussian RBF kernel matrix.

_median_heuristic(x: Tensor) float

Computes sigma using median heuristic.

_scotts_rule(x: Tensor) float

Computes sigma using Scott’s rule.

forward(X: Tensor, Y: Tensor, Z: Tensor) Tensor

Computes the NOCCO loss for conditional independence testing.

Parameters:
  • X – Input tensor of shape [batch_size, dim_x].

  • Y – Input tensor of shape [batch_size, dim_y].

  • Z – Conditioning tensor of shape [batch_size, dim_z].

Returns:

Scalar loss value (non-negative). Zero indicates conditional independence.

class scDRP.loss.UnnormalizedHSCICLoss(epsilon: float = 0.0001, sigma_x: float | None = None, sigma_y: float | None = None, sigma_z: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Bases: Module

Unnormalized Hilbert-Schmidt Conditional Independence Criterion (HSCIC).

Computes the unnormalized squared Hilbert-Schmidt norm of the conditional cross-covariance operator: ||sigma_Yẍ|Z||^2_HS.

Based on: Fukumizu et al. (2004) and Sheng & Sriperumbudur (2023)

Parameters:
  • epsilon (float) – Regularization parameter for matrix inversion. Default: 1e-4

  • sigma_x (float, optional) – Fixed bandwidth for X. If None, uses automatic selection.

  • sigma_y (float, optional) – Fixed bandwidth for Y. If None, uses automatic selection.

  • sigma_z (float, optional) – Fixed bandwidth for Z. If None, uses automatic selection.

  • sigma_method (str) – Method for automatic sigma selection. Options: - ‘median’: Median heuristic (default, recommended) - ‘scott’: Scott’s rule

  • sigma_scale (float) – Additional scaling factor for computed sigma. Default: 1.0

Example

>>> # Automatic sigma selection (recommended)
>>> loss_fn = UnnormalizedHSCICLoss()
>>> X, Y, Z = torch.randn(100, 5), torch.randn(100, 3), torch.randn(100, 2)
>>> loss = loss_fn(X, Y, Z)
>>>
>>> # Manual sigma specification
>>> loss_fn = UnnormalizedHSCICLoss(sigma_x=1.0, sigma_y=1.0, sigma_z=0.5)
>>> loss = loss_fn(X, Y, Z)
__init__(epsilon: float = 0.0001, sigma_x: float | None = None, sigma_y: float | None = None, sigma_z: float | None = None, sigma_method: Literal['median', 'scott'] = 'median', sigma_scale: float = 1.0)

Initializes the Unnormalized HSCIC loss.

_center_gram(K: Tensor) Tensor

Centers the Gram matrix in the RKHS feature space.

_compute_sigma(x: Tensor, fixed_sigma: float | None) float

Computes or returns sigma for a given tensor.

_gaussian_kernel(x: Tensor, y: Tensor, sigma: float) Tensor

Computes Gaussian RBF kernel matrix.

_median_heuristic(x: Tensor) float

Computes sigma using median heuristic.

_scotts_rule(x: Tensor) float

Computes sigma using Scott’s rule.

forward(x: Tensor, y: Tensor, z: Tensor) Tensor

Computes the unnormalized HSCIC loss value.

Parameters:
  • x – Input tensor X of shape [batch_size, dim_x].

  • y – Input tensor Y of shape [batch_size, dim_y].

  • z – Conditioning tensor Z of shape [batch_size, dim_z].

Returns:

Scalar loss value (non-negative). Zero indicates X ⊥⊥ Y | Z.

class scDRP.loss.ZINBLoss

Bases: Module

Zero-Inflated Negative Binomial Loss

__init__()

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, rho, dispersion, pi, s, eps=1e-08)
Parameters:
  • x – observed data

  • rho – mean gene expression (mean of the negative binomial distribution equals rho * s)

  • dispersion – dispersion of the negative binomial distribution

  • pi – zero-inflation parameter

  • s – scale parameter (library size)

  • eps – small value to prevent numerical instability

Returns:

negative log likelihood

Return type:

loss

scDRP.loss.klLoss(mu, logvar)

KL divergence between the latent distribution and the prior

Parameters:
  • mu – mean of the latent distribution

  • logvar – log variance of the latent distribution

  • Returns

  • kl – KL divergence

Returns:

KL divergence

Return type:

kl

scDRP.loss.klLoss_prior(mu_q, logvar_q, mu_p, logvar_p)

Compute KL(q || p) for two Gaussians q(z|x) ~ N(mu_p, exp(logvar_p)) and p(z) ~ N(mu_q, exp(logvar_q))

Parameters:
  • mu_q – mean of q

  • logvar_q – log variance of q

  • mu_p – mean of p

  • logvar_p – log variance of p

Returns:

KL divergence

Return type:

kl

scDRP.model module

class scDRP.model.Decoder(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, library_size_strategy='observed')

Bases: Module

Decoder network

__init__(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, library_size_strategy='observed')
Parameters:
  • input_dim – int, input dimension

  • covariate_dim – int, covariate dimension

  • layer_dims – list of int, hidden layer dimensions

  • latent_dim – int, latent dimension

  • dropout_rate – float, dropout rate in MLP

forward(z, c, dispersion_strategy='gene')
Parameters:
  • z – torch.Tensor, latent variables (batch_size x latent_dim)

  • c – torch.Tensor, covariate data (batch_size x covariate_dim)

  • dispersion_strategy – str, strategy to specify dispersion factor (we have two options: gene-wise and gene-cell wise but currently use gene-wise only)

Returns:

torch.Tensor, mean of the negative binomial distribution dispersion: torch.Tensor, dispersion of the negative binomial distribution pi: torch.Tensor, zero-inflation parameter

Return type:

rho

class scDRP.model.Encoder(device, input_dim=3000, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2)

Bases: Module

Encoder network

__init__(device, input_dim=3000, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2)
Parameters:
  • input_dim – int, input dimension

  • layer_dims – list of int, hidden layer dimensions

  • latent_dim – int, latent dimension

  • dropout_rate – float, dropout rate in MLP

forward(x)
Parameters:

x – torch.Tensor, input data (batch_size x input_dim)

Returns:

torch.Tensor, latent variable mu: torch.Tensor, mean of the latent variable logvar: torch.Tensor, log variance of the latent variable

Return type:

z

reparameterize(mu, logvar)

Reparameterization trick

Parameters:
  • mu – torch.Tensor, mean of the latent variable

  • logvar – torch.Tensor, log variance of the latent variable

Returns:

torch.Tensor, latent variable

Return type:

z

class scDRP.model.HardConcreteGate(size, beta=0.6666666666666666, gamma=-0.1, zeta=1.1)

Bases: Module

Hard Concrete Gate for L0 regularization

Parameters:
  • size – int, size of the gate

  • beta – float, temperature parameter. Default is 2/3

  • gamma – float, left stretch parameter. Default is -0.1

  • zeta – float, right stretch parameter. Default is 1.1

__init__(size, beta=0.6666666666666666, gamma=-0.1, zeta=1.1)

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x, training=True)

Forward pass with gating

Parameters:
  • x – input tensor

  • training – whether the model is in training mode

Returns:

gated output tensor gate: sampled gate values

Return type:

output

regularization_loss()

Compute the expected L0 penalty

Returns:

a tensor representing the expected L0 penalty (prob of being nonzero)

Return type:

prob_nonzero

sample_gate(training=True)

Sample gate values using Hard Concrete distribution

Parameters:

training – bool, whether in training mode

Returns:

torch.Tensor, sampled gate values

Return type:

gate

class scDRP.model.MSEDecoder(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, distribution='Normal')

Bases: Module

Decoder network

__init__(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, distribution='Normal')
Parameters:
  • input_dim – int, input dimension

  • covariate_dim – int, covariate dimension

  • layer_dims – list of int, hidden layer dimensions

  • latent_dim – int, latent dimension

  • dropout_rate – float, dropout rate in MLP

forward(z, c, dispersion_strategy='gene')
Parameters:
  • z – torch.Tensor, latent variable (batch_size x latent_dim)

  • c – torch.Tensor, covariate data (batch_size x covariate_dim)

  • dispersion_strategy – str, strategy to specify dispersion factor (we have two options: gene-wise and gene-cell wise but currently use gene-wise only)

Returns:

torch.Tensor, mean of the negative binomial distribution dispersion: torch.Tensor, dispersion of the negative binomial distribution pi: torch.Tensor, zero-inflation parameter

Return type:

rho

class scDRP.model.NBDecoder(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, library_size_strategy='observed')

Bases: Module

Decoder network

__init__(device, input_dim=3000, covariate_dim=1, layer_dims=[500, 100], latent_dim=20, dropout_rate=0.2, library_size_strategy='observed')
Parameters:
  • input_dim – int, input dimension

  • covariate_dim – int, covariate dimension

  • layer_dims – list of int, hidden layer dimensions

  • latent_dim – int, latent dimension

  • dropout_rate – float, dropout rate in MLP

forward(z, c, dispersion_strategy='gene')
Parameters:
  • z – torch.Tensor, latent variables (batch_size x latent_dim)

  • c – torch.Tensor, covariate data (batch_size x covariate_dim)

  • dispersion_strategy – str, strategy to specify dispersion factor (we have two options: gene-wise and gene-cell wise but currently use gene-wise only)

Returns:

torch.Tensor, mean of the negative binomial distribution dispersion: torch.Tensor, dispersion of the negative binomial distribution pi: torch.Tensor, zero-inflation parameter

Return type:

rho

class scDRP.model.PerturbNet(device, input_dim, covariate_dim=0, celltype_num=0, perturbation_num=2, layer_dims=[500, 100], latent_dep_dim=50, latent_ind_dim=50, dropout_rate=0.2, lambda_sparse=0, l0_latent=0.001, beta=1, lambda_hsic=0.2, distribution='ZINB', encoder_covariates=False, library_size_strategy='observed', eps=1e-10)

Bases: Module

scPerturb model

__init__(device, input_dim, covariate_dim=0, celltype_num=0, perturbation_num=2, layer_dims=[500, 100], latent_dep_dim=50, latent_ind_dim=50, dropout_rate=0.2, lambda_sparse=0, l0_latent=0.001, beta=1, lambda_hsic=0.2, distribution='ZINB', encoder_covariates=False, library_size_strategy='observed', eps=1e-10)
Parameters:
  • input_dim – int, input dimension

  • covariate_dim – int, covariate dimension (default: 1)

  • celltype_num – int, number of cell types (default: 1)

  • perturbation_num – int, number of perturbations (default: 2)

  • layer_dims – list of int, hidden layer dimensions (default: [500,100])

  • latent_dep_dim – int, latent dimension for dependent variable (default: 20)

  • latent_ind_dim – int, latent dimension for independent variable (default: 10)

  • dropout_rate – float, dropout rate in MLP (default: 0.2)

  • lambda_sparse – float, sparsity penalty (default: 0)

  • beta – float, KL divergence weight (default: 10)

  • encoder_covariates – boolean, whether to include covariates in encoders (default: False)

  • eps – float, small value to prevent numerical instability (default: 1e-10)

forward(x, a, t, c, train=True)
Parameters:
  • x – torch.Tensor, input data (batch_size, input_dim)

  • a – torch.Tensor, perturbation data (batch_size, perturbation_num)

  • t – torch.Tensor, cell type data (batch_size, celltype_num)

  • c – torch.Tensor, covariate data (batch_size, covariate_dim)

  • train

Returns:

torch.Tensor, latent representation that depends on perturbation, (batch_size, latent_dep_dim) z_u: torch.Tensor, latent representation that is independent from perturbation, (batch_size, latent_ind_dim) mu_d: torch.Tensor, mean of latent representation that depends on perturbation, (batch_size, latent_dep_dim) mu_u: torch.Tensor, mean of latent representation that is independent from perturbation, (batch_size, latent_ind_dim) rho: torch.Tensor, mean of the negative binomial distribution, (batch_size, input_dim) dispersion: torch.Tensor, dispersion of the negative binomial distribution, (input_dim,) when gene-wise pi: torch.Tensor, zero-inflation parameter, (batch_size, input_dim) s: torch.Tensor, library size, (batch_size, 1) loss: torch.Tensor, total loss loss_dict: dict, dictionary of losses

Return type:

z_d

reparameterize(mu, logvar)

Reparameterization trick

Parameters:
  • mu – torch.Tensor, mean of the latent variable

  • logvar – torch.Tensor, log variance of the latent variable

Returns:

torch.Tensor, latent variable

Return type:

z

sample_sequencing_depth(x, strategy='observed')

Sample sequencing depth

Parameters:
  • x – torch.Tensor, observed data

  • strategy – str, strategy to sample sequencing depth. We have two options: batch_sample and observed, but will use observed only currently

Returns:

torch.Tensor, library size

Return type:

s

scDRP.module module

class scDRP.module.DosageModel(models)

Bases: object

__init__(models)

initialize the DosageModel with given models for mu and sigma.

Parameters:

models – a dictionary containing ‘mu’ and ‘sigma’ models.

predict(v_new_log)

Predict the mu and sigma for new dosage values.

Parameters:

v_new_log – a numpy array of log-transformed dosage values.

Returns:

a numpy array of predicted means. sigma_pred: a numpy array of predicted standard deviations.

Return type:

mu_pred

class scDRP.module.Perturb(adata, layer=None, perturbation_key='perturbation', celltype_key=None, batch_key=None, dose_key=None, distribution='ZINB')

Bases: object

__init__(adata, layer=None, perturbation_key='perturbation', celltype_key=None, batch_key=None, dose_key=None, distribution='ZINB')

Initialize the Perturb object.

Parameters:
  • adata – an AnnData object

  • layer – a string representing the layer name of count matrix in adata

  • perturbation_key – a string representing the key of perturbation in adata.obs

  • celltype_key – a string representing the key of cell type in adata.obs

  • batch_key – a string representing the key of batch in adata.obs

  • dose_key – a string representing the key of dose in adata.obs

  • distribution – a string representing the distribution of the data. Options (default: “ZINB”): - “ZINB”: Zero-Inflated Negative Binomial distribution - “NB”: Negative Binomial distribution - “Normal”: Gaussian distribution - “Normal_positive”: Gaussian distribution with positive output

_fit_dose_to_dist(v, zd, method='gpr')

Fit models to map dosage values to latent distributions.

Parameters:
  • v – a numpy array representing the dosage values.

  • zd – a numpy array representing the latent embeddings.

  • method – a string representing the method to fit dosage to latent distribution. Options are ‘linear’, ‘spline’, or ‘gpr’. Defaults to ‘gpr’.

Returns:

a DosageModel object containing the fitted models.

_fit_gpr(doses, means, stds)

Fit Gaussian Process Regressors for the given doses, means, and standard deviations.

Parameters:
  • doses – a numpy array representing the unique dosage values.

  • means – a numpy array representing the mean latent embeddings.

  • stds – a numpy array representing the standard deviation of latent embeddings.

Returns:

a dictionary containing the fitted mu and sigma models.

_fit_interpolator(doses, means, stds, method)

Fit interpolators for the given doses, means, and standard deviations.

Parameters:
  • doses – a numpy array representing the unique dosage values.

  • means – a numpy array representing the mean latent embeddings.

  • stds – a numpy array representing the standard deviation of latent embeddings.

  • method – a string representing the interpolation method. Options are ‘linear’ or ‘spline’.

Returns:

a dictionary containing the fitted mu and sigma models.

_generate_from_latent(latent, covariates=None, library_size=None, n_samples=1)

Generate samples from latent embeddings

Parameters:
  • latent – torch Tensor containing latent factors, (sample_size, latent_dimensions)

  • covariates – torch Tensor containing one-hot encoded covariates, (sample_size, covariate_dimensions). Default: None

  • library_size – torch Tensor containing library sizes for new generate samples, (sample_size,1). Inferred library size of original adata will be used if None

  • n_samples – an integer representing the number of samples (Note: if latent is assigned a n*p matrix, then n_samples*n samples will generated in total!)

Returns:

a numpy array representing the generated samples

_generate_latent(perturbations, celltype, batch_size=32)

Generate latent embeddings from perturbations and cell types.

Parameters:
  • perturbations – a numpy array representing one-hot encoded perturbations

  • celltype – a numpy array representing one-hot encoded cell types

  • batch_size – an integer representing the batch size for DataLoader

Returns:

a numpy array representing the generated latent embeddings

_representative_latent(mu, logvar)

Generate representative latent embeddings by sampling from the learned distribution.

Parameters:
  • mu – a numpy array representing the mean of the latent distribution

  • logvar – a numpy array representing the log variance of the latent distribution

Returns:

a numpy array representing the sampled latent embeddings

_sample_from_parameter(n_samples, inference=False)

Generate samples from inferred ZINB parameters.

Parameters:
  • n_samples – an integer representing the number of samples

  • inference – a boolean representing whether to reconduct the inference process each time or simply use inferred parameters stored in the object

Returns:

a numpy array representing the samples

counterfactual_samples(control, treatment, dose=None, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='sinkhorn', reg=0.01, reg_m=1.0)

Estimate the Individual Treatment Effect (ITE) for a pair of control and treatment.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment – a string representing the name of treatment group (should be a value in adata.obs[perturbation_key]).

  • dose – a float representing the dose level (should be a value in adata.obs[dose_key]). If None, all doses will be considered.

  • strategy – a string representing the counterfactual generation strategy. Options: - “ot”: Optimal Transport - “average”: Average Effect Addition

  • alpha – a float representing the weight for zu embeddings in OT calculation. Default is 1.

  • beta – a float representing the weight for zd embeddings in OT calculation. Default is 1.

  • projection_strategy – a string representing the projection strategy for OT calculation. Options: - “full”: use full embeddings for OT calculation - “zd_only”: project zd embeddings while keep zu embeddings unchanged

  • value – a string representing the value type for OT cost calculation. Options: - “raw”: use raw embeddings for OT cost calculation - “quantile”: use quantile normalized embeddings for OT cost calculation

  • method (str, optional) – Optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg (float, optional) – Entropic regularization parameter for Sinkhorn. Default is 0.1. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m (float, optional) – Marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

a numpy array representing the ITE

create_dataset(adata)

Create a dataset from the AnnData object.

Parameters:

adata – an AnnData object.

Returns:

a CombinedDataset object.

dosage_extrapolate(control, treatment, dose_value=None, method='gpr')

Extrapolate latent embeddings for control group to new dosage levels based on treatment group data.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment – a string representing the name of treatment group (should be a value in adata.obs[perturbation_key]).

  • dose_value – a scalar or array-like representing the dosage value(s) for extrapolation.

  • method – a string representing the method to fit dosage to latent distribution. Options are ‘linear’, ‘spline’, or ‘gpr’. Defaults to ‘gpr’.

Returns:

a numpy array or a list of numpy arrays representing the counterfactual mean expressions at the specified dosage level(s).

effect_estimate(control, treatment, dose=None, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='emd', reg=0.01, reg_m=1.0)

Estimate the Individual Treatment Effect (ITE) for a pair of control and treatment.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment – a string representing the name of treatment group (should be a value in adata.obs[perturbation_key]).

  • method (str, optional) – Optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg (float, optional) – Entropic regularization parameter for Sinkhorn. Default is 0.1. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m (float, optional) – Marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

a numpy array representing the ITE

get_adata()

Get the adata with latent variables and estimated generative parameters.

Returns:

an AnnData object with latent embeddings and estimated ZINB generative parameters.

get_counterfactual_adata(control, treatment, dose=None, covariates=None, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='sinkhorn', reg=0.01, reg_m=1.0)

Generate counterfactual AnnData for a pair of control and treatment.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment – a string representing the name of treatment group (should be a value in adata.obs[perturbation_key]).

  • dose – a float representing the dose level (should be a value in adata.obs[dose_key]). If None, all doses will be considered.

  • covariates – a numpy array representing the one-hot encoded covariates for control group. If None, inferred covariates will be used.

  • strategy – a string representing the counterfactual generation strategy. Options: - “ot”: Optimal Transport - “average”: Average Effect Addition

  • alpha – a float representing the weight for zu embeddings in OT calculation. Default is 1.

  • beta – a float representing the weight for zd embeddings in OT calculation. Default is 1.

  • projection_strategy – a string representing the projection strategy for OT calculation. Options: - “full”: use full embeddings for OT calculation - “zd_only”: project zd embeddings while keep zu embeddings unchanged

  • value – a string representing the value type for OT cost calculation. Options: - “raw”: use raw embeddings for OT cost calculation - “quantile”: use quantile normalized embeddings for OT cost calculation

  • method (str, optional) – Optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg (float, optional) – Entropic regularization parameter for Sinkhorn. Default is 0.1. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m (float, optional) – Marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

an AnnData object representing the counterfactual samples

get_counterfactual_latent(control, treatment, dose=None, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='emd', reg=0.01, reg_m=1.0)

Generate counterfactual latent embeddings for a pair of control and treatment.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment – a string representing the name of treatment group (should be a value in adata.obs[perturbation_key]).

  • dose – a float representing the dose level (should be a value in adata.obs[dose_key]). If None, all doses will be considered.

  • strategy – a string representing the strategy to generate counterfactual latent embeddings. Options: - “ot”: use optimal transport to estimate the latent shift - “average”: use average effect to estimate the latent shift

  • alpha – a float number representing the weight for zu in optimal transport. Default is 1.

  • beta – a float number representing the weight for zd in optimal transport. Default is 1.

  • projection_strategy – a string representing the projection strategy in optimal transport. Options: - “full”: use full latent embeddings for optimal transport - “zd_only”: transport only on dependent latent embeddings with independent latent embeddings fixed

  • value – a string representing the value type in optimal transport. Options: - “raw”: use raw latent embeddings for optimal transport - “quantile”: use quantile transformed latent embeddings for optimal transport

  • method – a string representing the optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg – a float number representing the entropic regularization parameter for Sinkhorn. Default is 0.01. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m – a float number representing the marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

a tuple of numpy arrays representing the counterfactual latent embeddings (z,zd,zu)

get_latent()

Return the disentangled latent embeddings.

get_parameter_from_latent(latent, covariates=None, batch_size=32)

Generate generative model (ZINB) parameters from latent embeddings

Parameters:
  • latent – torch Tensor containing latent factors, (batch_size, latent_dimensions)

  • covariates – torch Tensor containing one-hot encoded covariates, (batch_size, covariate_dimensions). Default: None

Returns:

a tuple, ZINB paramters rho, dispersion, and pi in generative models

inference(n_samples=1, dataset=None, batch_size=None, update=False, returns=False)

Perform inference.

Parameters:
  • n_samples – an integer representing the number of samples repeated for inference process

  • dataset – a CombinedDataset object. By default we use the training dataset

  • batch_size – an integer representing the batch size

  • update – a boolean representing whether to update the adata

  • returns – a boolean representing whether to return the results

Returns:

a tuple of numpy arrays representing the latent variables (z_d,z_u,mu_d,mu_u,rho,dispersion,dropout_rate,library_size)

joint_effect_estimate(control, treatment1, treatment2, dose1=None, dose2=None, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='sinkhorn', reg=0.01, reg_m=1.0)

Estimate the joint effect for a pair of treatments compared to a control.

Parameters:
  • control – a string representing the name of control group (should be a value in adata.obs[perturbation_key]).

  • treatment1 – a string representing the name of first treatment group (should be a value in adata.obs[perturbation_key]).

  • treatment2 – a string representing the name of second treatment group (should be a value in adata.obs[perturbation_key]).

  • dose1 – a float representing the dose level for first treatment (should be a value in adata.obs[dose_key]). If None, all doses will be considered.

  • dose2 – a float representing the dose level for second treatment (should be a value in adata.obs[dose_key]). If None, all doses will be considered.

  • strategy – a string representing the counterfactual generation strategy. Options: - “ot”: Optimal Transport - “average”: Average Effect Addition

  • alpha – a float representing the weight for zu embeddings in OT calculation. Default is 1.

  • beta – a float representing the weight for zd embeddings in OT calculation. Default is 1.

  • projection_strategy – a string representing the projection strategy for OT calculation. Options: - “full”: use full embeddings for OT calculation - “zd_only”: project zd embeddings while keep zu embeddings unchanged

  • value – a string representing the value type for OT cost calculation. Options: - “raw”: use raw embeddings for OT cost calculation - “quantile”: use quantile normalized embeddings for OT cost calculation

  • method (str, optional) – Optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg (float, optional) – Entropic regularization parameter for Sinkhorn. Default is 0.1. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m (float, optional) – Marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

a numpy array representing the joint effect

latent_adaption(control_zd, treatment_zd, control_zu, treatment_zu, control_celltype, treatment_celltype, strategy='ot', alpha=1, beta=1, projection_strategy='full', value='raw', method='emd', reg=0.01, reg_m=1.0)

Generate counterfactual latent embeddings for a pair of control and treatment.

Parameters:
  • control_zd – a numpy array representing the dependent latent embeddings of control group

  • treatment_zd – a numpy array representing the dependent latent embeddings of treatment group

  • control_zu – a numpy array representing the independent latent embeddings of control group

  • treatment_zu – a numpy array representing the independent latent embeddings of treatment group

  • control_celltype – a numpy array representing the one-hot encoded cell types of control group

  • treatment_celltype – a numpy array representing the one-hot encoded cell types of treatment group

  • strategy – a string representing the strategy to generate counterfactual latent embeddings. Options: - “ot”: use optimal transport to estimate the latent shift - “average”: use average effect to estimate the latent shift

  • alpha – a float number representing the weight for zu in optimal transport. Default is 1.

  • beta – a float number representing the weight for zd in optimal transport. Default is 1.

  • projection_strategy – a string representing the projection strategy in optimal transport. Options: - “full”: use full latent embeddings for optimal transport - “zd_only”: transport only on dependent latent embeddings with independent latent embeddings fixed

  • value – a string representing the value type in optimal transport. Options: - “raw”: use raw latent embeddings for optimal transport - “quantile”: use quantile transformed latent embeddings for optimal transport

  • method – a string representing the optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg – a float number representing the entropic regularization parameter for Sinkhorn. Default is 0.01. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m – a float number representing the marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

a numpy array representing the counterfactual latent embeddings counterfactual_zd: a numpy array representing the counterfactual dependent latent embeddings counterfactual_zu: a numpy array representing the counterfactual independent latent embeddings

Return type:

counterfactual_z

load(path, device=None)

Load the trained model.

Parameters:
  • path – a string representing the path to the model checkpoint.

  • device – the device to load the model onto. Default is None, which uses self.device.

sample_posterior(n_samples=1, mu=None, theta=None, pi=None)

Sample from the posterior distribution.

Parameters:
  • n_samples – an integer representing the number of samples (Note: if mu is assigned a n*p matrix, then n_sample*n samples will generated in total!)

  • mu – a numpy array representing the mean of the negative binomial distribution.

  • theta – a numpy array representing the dispersion of the negative binomial distribution

  • pi – a numpy array representing the dropout rate

Returns:

a numpy array representing the samples

save(path)

Save the trained model.

Parameters:

path – a string representing the path to the model checkpoint.

setup(hidden_layers=[128, 128], latent_dependent=50, latent_independent=50, beta=1, sparse_coef=0, l0_latent=0.001, lambda_hsic=0.2, library_size_strategy='observed', device=None)

Setup the model for training.

Parameters:
  • hidden_layers – a list of integers representing the number of neurons in each hidden layer

  • latent_dependent – an integer representing the dimensions of z_D

  • latent_independent – an integer representing the dimensions of z_I

  • beta – a float number representing the weight of the KL divergence term

  • sparse_coef – a float number representing the weight of the sparsity regularization on the Jacobian matrix

  • l0_latent – a float number representing the weight of the L0 regularization on the dimensions of latent variables

  • lambda_hsic – a float number representing the weight of the HSIC regularization

  • library_size_strategy – a string representing the library size normalization strategy. Options (default: “observed”): - “batch_sample”: sample from batch empirical distribution - “observed”: use the observed library size - “original”: set the library size as 1

  • device – the device to run the model on. Default is None, which uses ‘cuda’ if available, else ‘cpu’.

train(epoch_num=200, batch_size=64, lr=1e-06, accumulation_steps=1, adaptlr=True, valid_prop=0, early_stopping=False, patience=10, tensorboard=False, savepath='./')

Train the model.

Parameters:
  • epoch_num – an integer representing the number of epochs

  • batch_size – an integer representing the batch size

  • lr – a float number representing the learning rate

  • accumulation_steps – an integer representing the number of steps for gradient accumulation

  • adaptlr – a boolean representing whether to use adaptive learning rate

  • valid_prop – a float number representing the proportion of the dataset to use for validation

  • early_stopping – a boolean representing whether to use early stopping

  • patience – an integer representing the number of epochs to wait for improvement before stopping

  • tensorboard – a boolean representing whether to use tensorboard

  • savepath – a string representing the path to save the tensorboard logs

scDRP.train module

class scDRP.train.EarlyStopping(patience=25, delta=0.0, verbose=False)

Bases: object

Early stops the training if validation loss doesn’t improve after a given patience.

__init__(patience=25, delta=0.0, verbose=False)

Initialize the EarlyStopping object.

Parameters:
  • patience – number of epochs with no improvement after which training will be stopped

  • delta – minimum change in the monitored quantity to qualify as an improvement

  • verbose – whether to print messages when stopping

save_checkpoint(model, path='best_model.pt')

Saves model when validation loss decrease.

scDRP.train.inference_model(device, infer_dataset, model, batch_size, count_data=False)

Do inference using the trained model

Parameters:
  • device – device to run the model

  • infer_dataset – inference dataset

  • model – trained model to inference

  • batch_size – batch size

  • count_data – whether the data is count data

Returns:

latent representation that depends on perturbation zu: latent representation that is independent from perturbation mu_d: mean of latent representation that depends on perturbation mu_u: mean of latent representation that is independent from perturbation rho: mean expression level dispersion: dispersion parameter (only for count data) pi: zero-inflation parameter (only for count data) library_size: size of the library (only for count data)

Return type:

zd

scDRP.train.train_model(device, writer, train_dataset, valid_dataset, model, epoch_num, batch_size, num_batch, lr=1e-06, accumulation_steps=1, adaptlr=True, count_data=False, early_stopping=True, patience=25)

Train the model

Parameters:
  • device – device to run the model

  • writer – tensorboard writer

  • train_dataset – training dataset

  • model – model to train

  • epoch_num – number of epochs

  • batch_size – batch size

  • num_batch – number of batches

  • lr – learning rate

  • accumulation_steps – number of steps to accumulate gradients

  • adaptlr – whether to adapt learning rate

  • count_data – whether the data is count data

  • early_stopping – whether to use early stopping

  • patience – patience for early stopping

scDRP.train.validate_model(device, validate_dataset, model, batch_size, count_data)

Validate the model

Parameters:
  • device – device to run the model

  • validate_dataset – validation dataset

  • model – model to validate

  • batch_size – batch size

  • count_data – whether the data is count data

scDRP.utils module

scDRP.utils.add_effect(control_zd, treatment_zd, control_label, treatment_label)

Computes counterfactual latent representations by adding the average treatment effect within each group. :param control_zd: Source matrix of shape (N, D_d). :type control_zd: np.ndarray :param treatment_zd: Target matrix of shape (M, D_d). :type treatment_zd: np.ndarray :param control_label: Group labels of shape (N,), indicating which group each row in X belongs to. :type control_label: np.ndarray :param treatment_label: Group labels of shape (M,), indicating which group each row in Y belongs to. :type treatment_label: np.ndarray

scDRP.utils.condition_quantile(X, y=None)

Calculate the quantile of each column in X conditioned on y.

Parameters:
  • X – a numpy array

  • y – a numpy array

Returns:

a numpy array representing the quantile of each column in X conditioned on y

scDRP.utils.conditional_OT_latent(control_zd, treatment_zd, control_zu, treatment_zu, control_label, treatment_label, alpha=1, beta=1, projection_strategy='full', value='raw', method='emd', reg=0.1, reg_m=1.0, eps=1e-08)

Computes a global optimal transport matching matrix (quantile matching of zd and value matching of zip), ensuring that matches occur only within groups.

Parameters:
  • control_zd (np.ndarray) – Source matrix of shape (N, D_d).

  • control_zu (np.ndarray) – Source matrix of shape (N, D_i).

  • treatment_zd (np.ndarray) – Target matrix of shape (M, D_d).

  • treatment_zu (np.ndarray) – Target matrix of shape (M, D_i).

  • control_label (np.ndarray) – Group labels of shape (N,), indicating which group each row in X belongs to.

  • treatment_label (np.ndarray) – Group labels of shape (M,), indicating which group each row in Y belongs to.

  • alpha (float, optional) – Weight for the distance in zd space. Default is 1.

  • beta (float, optional) – Weight for the distance in zu space. Default is 1

  • projection_strategy (str, optional) – Strategy for constructing counterfactual latent representations. Options: - “full”: Use both zd and zu for counterfactual construction. - “zd_only”: Use only zd for counterfactual construction, keeping zu unchanged. Defaults to “full”.

  • method (str, optional) – Optimal transport method. Options: - “emd”: Exact Optimal Transport - “sinkhorn”: Sinkhorn Regularized OT - “unbalanced_sinkhorn”: Unbalanced Sinkhorn Regularized OT Defaults to “emd”.

  • reg (float, optional) – Entropic regularization parameter for Sinkhorn. Default is 0.1. Only useful when specifying method as “sinkhorn” or “unbalanced_sinkhorn”.

  • reg_m (float, optional) – Marginal relaxation parameter (higher allows more mass deviation). Default is 1.0. Only useful when specifying method as “unbalanced_sinkhorn”.

Returns:

A global matching matrix (N, M), where matches are restricted within the same group.

Return type:

np.ndarray

scDRP.utils.to_dense_array(x)

Transform a potential sparse array to numpy array :param x: input array, can be sparse or numpy array

Returns:

a numpy array

Module contents

A package to learn disentangled representations and estimate individual treatment effects in single-cell perturbation data.