Training Gene Expression (GEX) data by VAE

This is a tutorial for training GEX data by VAE. We use scAtlasVAE package to integrate the multi-batch GEX data. We use the human huARdb v2 reference dataset as an example.

Load the reference dataset

import tcr_deep_insight as tdi
import torch

gex_reference_adata = tdi.data.human_gex_reference_v2()

Construct and train the GEX model

# GEXModelingVAE is an alias of scatlasvae.model.scAtlasVAE

model = tdi.model.GEXModelingVAE(
  gex_reference_adata,
  batch_key=['study_name','sample_name'],
  n_latent=10,
  batch_hidden_dim=24
)

model.fit()

Extract the GEX embedding and Save the trained model

gex_reference_adata.obsm['X_gex'] = model.get_latent_representation()

torch.save(
  model.state_dict(),
  "/PATH/TO//tcr_deep_insight/data/pretrained_weights/human_scatlasvae_gex_v2.ckpt"
)

For more detailed information of scAtlasVAE, please refer to the scAtlasVAE documentation.

Note

The trained model is available at Zenodo.

Clustering and UMAP visualization

The downstream analysis can be performed using scanpy package’s standard workflow.

import scanpy as sc

sc.pp.neighbors(gex_reference_adata, use_rep='X_gex', n_neighbors=15)
sc.tl.umap(gex_reference_adata)
sc.tl.leiden(gex_reference_adata)

sc.pl.umap(gex_reference_adata, color='leiden')