API
- class tcr_deep_insight.TCR(cdr3a: str | None = None, cdr3b: str | None = None, trav: str | None = None, trbv: str | None = None, traj: str | None = None, trbj: str | None = None, individual: str | None = None, species: Literal['human', 'mouse'] = 'human')[source]
TCR class to store TCR information and provide utility functions.
- Parameters:
cdr3a – CDR3 alpha sequence
cdr3b – CDR3 beta sequence
trav – TRAV gene
trbv – TRBV gene
traj – TRAJ gene
trbj – TRBJ gene
individual – Individual identifier
species – Species identifier
- property cdr3a
- property cdr3b
- property individual
- property species
- property traj
- property trav
- property trbj
- property trbv
Preprocessing
- tcr_deep_insight.pp.update_anndata(gex_adata: AnnData, gex_embedding_key: str = 'X_gex', tcr_embedding_key: str = 'X_tcr', joint_embedding_key: str = 'X_gex_tcr') None[source]
Update the adata with the embedding keys
Note
TCR information should be included in gex_adata.obs. This method modifies the gex_adata inplace. added columns in .obs: tcr, CDR3a, CDR3b, TRAV, TRAJ, TRBV, TRBJ
- Parameters:
gex_adata – AnnData object
gex_embedding_key – embedding key for gex
tcr_embedding_key – embedding key for tcr
joint_embedding_key – embedding key for joint
- tcr_deep_insight.pp.unique_tcr_by_individual(gex_adata: ~anndata._core.anndata.AnnData, embedding_key: str | ~typing.Iterable[str] = 'X_gex', label_key: str | None = None, additional_label_keys: ~typing.Iterable[str] = None, aggregate_func: ~typing.Callable = <function majority_vote>) AnnData[source]
Unique TCRs by individual and aggregate GEX embedding by TCR. Unique TCR is defined by the combination of TRAV,TRAJ,TRBV,TRBJ,CDR3α,CDR3β and individual. Also aggregate GEX embedding by TCR, and add the aggregated GEX embedding to the tcr_adata.obsm[gex_embedding_key].
Note
“individual” should be in gex_adata.obs.columns.
- Parameters:
gex_adata – AnnData object of gene expression data
embedding_key – Key(s) in adata.obsm where GEX embedding is stored. Default: ‘X_gex’
label_key – Key in adata.obs where TCR type abels are stored. Default: ‘cell_type’, where ‘cell_type’ should be included in adata.obs.columns
additional_label_keys – Additional keys in adata.obs where TCR type labels are stored. Default: None
map_function – Function to aggregate labels. Default: majority_vote
- Returns:
TCR adata
Data
- tcr_deep_insight.data.human_gex_reference_v2()[source]
Load the human gex reference v2. If the dataset is not found, it will be downloaded from Zenodo.
Tool
- tcr_deep_insight.tl.add_tcr_pseudosequence_to_dataframe(df: DataFrame, species: bool = 'human') None[source]
Add TCR pseudosequence to dataframe
- Parameters:
df – dataframe
- Note:
- This method modifies the df inplace and adds the following columns:
tcr_pseudosequence
- tcr_deep_insight.tl.add_pmhc_pseudosequence_to_dataframe(df: DataFrame) None[source]
Add PMHC pseudosequence to dataframe
- Parameters:
df – dataframe
- Note:
- This method modifies the df inplace and adds the following columns:
hla_pseudosequence
pmhc_pseudosequence
- tcr_deep_insight.tl.tcr_adata_to_datasets(adata: AnnData, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer | None = None, show_progress: bool = False) Dataset[source]
Convert adata to tcr datasets
- Parameters:
adata – AnnData
tokenizer – tokenizer
- Returns:
tcr datasets
- tcr_deep_insight.tl.tcr_dataframe_to_datasets(df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, show_progress: bool = False) Dataset[source]
Convert dataframe to tcr datasets
- Parameters:
df – dataframe
tokenizer – tokenizer
- Returns:
tcr datasets
- tcr_deep_insight.tl.to_embedding_tcr_only(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, tcr_dataset: Dataset, k: str = 'hidden_states', device: str = 'cuda', n_per_batch: int = 64, show_progress: bool = False) ndarray[source]
Get embedding from model
- Parameters:
model – nn.Module. The TCR model.
tcr_dataset – datasets.arrow_dataset.Dataset. evaluation datasets.
k – str. ‘hidden_states’ or ‘last_hidden_state’.
device – str. ‘cuda’ or ‘cpu’. If ‘cuda’, use GPU. If ‘cpu’, use CPU.
n_per_batch – int. Number of samples per batch.
show_progress – bool. If True, show progress bar.
- Returns:
embedding
- tcr_deep_insight.tl.to_embedding_tcr_only_from_pandas(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, device: str, n_per_batch: int = 64, pooling: TCR_BERT_POOLING = 'mean')[source]
- tcr_deep_insight.tl.get_pretrained_tcr_embedding(tcr_adata: AnnData, bert_config: Mapping[str, Any], checkpoint_path: str | PathLike, encoding: TCR_BERT_ENCODING = 'vjcdr3', pooling: TCR_BERT_POOLING = 'mean', species: Literal['human', 'mouse'] = 'human', pca_path: str | None = None, use_pca: bool = True, use_kernel_pca: bool = False, use_faiss_pca: bool = True, pca_n_components: int = 50, device: str = 'cuda:0', n_per_batch: int = 256)[source]
Get TCR embedding from pretrained BERT model
Note
- This method modifies the tcr_adata inplace. It adds the following fields to tcr_adata.obsm:
X_tcr: TCR embedding
X_tcr_pca: PCA of TCR embedding
- Parameters:
tcr_adata – AnnData object containing TCR data
bert_config – BERT config
checkpoint_path – Path to pretrained BERT model.
encoding – Encoding type of tcr sequence.
pooling – Pooling method for tcr representation.
species – Species.
pca_path – Path to PCA model, if previously saved.
use_pca – Whether to use PCA. Default: True
use_kernel_pca – Whether to use Kernel PCA. High memory required for large dataset.
use_faiss_pca – Whether to use Faiss PCA instead fo scikit-learn.
pca_n_components – Number of PCA components.
device – Device for Faiss PCA computation.
n_per_batch – Number of samples per batch in getting TCR embeddings.
Clustering
- tcr_deep_insight.tl.cluster_tcr(tcr_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, layer_norm: bool = True, max_distance: float = 4.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') TDIResult[source]
Cluster TCRs by joint TCR-GEX embedding. All TCRs will be used as cluster anchors.
- Parameters:
tcr_adata – AnnData object containing TCR data
label_key – Key of the label to cluster. Should be in tcr_adata.obs.columns
use_gpu – Whether to use GPU.
gpu – GPU ID if use_gpu.
pure_label – Whether to constrain all TCRs in a cluster to have the same label.
pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.
same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.
same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.
same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.
same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.
layer_norm – Whether to use LayerNorm on tcr embedding. Default: True
max_distance – Maximum TrGx distance. Default: 4.
max_cluster_size – Maximum cluster size for dTCR clusters.
use_gex – Whether to use GEX embedding for clustering.
filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.
nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size
calculate_perm_test – Whether to calculate p-values from permutation test. Default: True
species – Species name.
n_jobs – Number of threads for parallel processing.
faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS
- Returns:
TDIResult containing clustered TCRs.
Note
The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.
- tcr_deep_insight.tl.cluster_tcr_from_reference(tcr_adata: ~anndata._core.anndata.AnnData, tcr_reference_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, layer_norm: bool = True, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, max_distance: float = 3.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') TDIResult[source]
Cluster TCRs from reference. Only TCRs in the query dataset will be used as cluster anchors.
- Parameters:
tcr_adata – AnnData object containing TCR data
tcr_reference_adata – AnnData object containing reference TCR data
label_key – Key of the label to cluster. Should be in tcr_adata.obs
use_gpu – Whether to use GPU.
gpu – GPU ID if use_gpu.
pure_label – Whether to constrain all TCRs in a cluster to have the same label.
pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.
same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.
same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.
same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.
same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.
layer_norm – Whether to use LayerNorm on tcr embedding. Default: True
max_distance – Maximum TrGx distance. Default: 4.
max_cluster_size – Maximum cluster size for dTCR clusters.
use_gex – Whether to use GEX embedding for clustering.
filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.
nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size
calculate_perm_test – Whether to calculate morista horn permutation test for TCR clusters.
species – Species name.
n_jobs – Number of threads for parallel processing. Default: 1
faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS
- Returns:
TDIResult object containing clustered TCRs.
Note
The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.
- tcr_deep_insight.tl.inject_labels_for_tcr_cluster_adata(reference_data: ~anndata._core.anndata.AnnData | ~pandas.core.frame.DataFrame, tcr_cluster_adata: ~anndata._core.anndata.AnnData, label_key: str, map_function: ~typing.Callable = <function majority_vote>)[source]
Inject labels for tcr_cluster_adata based on reference_adata
- Parameters:
reference_adata – sc.AnnData. Reference AnnData object containing labels
tcr_cluster_adata – sc.AnnData
label_key – str. Key of the label to use for clustering in reference_adata.obs.columns
map_function – Callable. Default: function that returns the most frequent label.
- Note:
- This method modifies the df inplace and adds the following columns:
label_key: The most frequent label in the cluster
- class tcr_deep_insight.tl.TDIResult(_data: AnnData, _tcr_df: DataFrame | None = None, _tcr_adata: AnnData | None = None, _gex_adata: AnnData | None = None, _cluster_label: str | None = None, faiss_index: IndexFlatL2 | None = None, low_memory: bool = False)[source]
Bases:
objectA class to store the TDI clustering result
- Parameters:
_data – the clustering result
_tcr_df – the tcr dataframe
_tcr_adata – the tcr anndata
_gex_adata – the gex anndata
_cluster_label – the cluster label
faiss_index – the faiss index
low_memory – whether to use low memory mode
- property D
- property I
- property cluster_label
- property data
- get_tcrs_by_cluster_index(cluster_index: int, _n_after: int = 0) List[str][source]
Get the tcrs for a specific cluster
- Parameters:
cluster_index – the cluster index
_n_after – the number of tcrs to skip
- get_tcrs_for_cluster(label: Mapping[str, str] | Mapping[str, List[str]] | None = None, rank: int = 0, rank_by: Literal['convergence', 'disease_association'] = 'convergence', min_unique_tcr_number: int = 4, min_individual_number: int = 2, min_cell_number: int = 10, min_tcr_convergence_score: float | None = None, min_disease_association_score: float | None = None, return_background_tcrs: bool = False, additional_label_key_values: Dict[str, List[str]] | None = None)[source]
Get the tcrs for a specific cluster
- Parameters:
label – the cluster label
rank – the rank of the tcrs to return
rank_by – the metric to rank the tcrs
min_unique_tcr_number – the minimum number of unique tcrs in the cluster
min_individual_number – the minimum number of individuals in the cluster
min_cell_number – the minimum number of cells in the cluster
min_tcr_convergence_score – the minimum convergence score
min_disease_association_score – the minimum disease specificity score
return_background_tcrs – whether to return other tcrs in the cluster
additional_label_key_values – additional label key values to filter the cluster
- Returns:
a dictionary containing the tcrs and their metadata
- property gex_adata
- classmethod load_from_disk(save_path: str, tcr_data_path: str | None = None, gex_adata_path: str | None = None)[source]
Load the cluster result from disk
- Parameters:
save_path – the path to load the cluster result
tcr_data_path – the path to load the tcr data
gex_adata_path – the path to load the gex data
- save_to_disk(save_path, save_cluster_result_as_csv=True, save_tcr_data=True, save_gex_data=True)[source]
Save the cluster result to disk
- Parameters:
save_path – the path to save the cluster result
save_cluster_result_as_csv: whether to save the cluster result as csv files :param save_tcr_data: whether to save the tcr data :param save_gex_data: whether to save the gex data
- property tcr_adata
- property tcr_df
- to_pandas_dataframe_tcr(rank_by: Literal['convergence', 'disease_association', 'individual', 'unique_tcr'] = 'convergence', return_background_tcrs: bool = False)[source]
Convert the cluster result to a pandas dataframe.
- Parameters:
rank_by – the metric to rank the tcrs
return_background_tcrs – whether to return background tcrs for each cluster
- Returns:
a pandas dataframe containing the cluster result
Clustering options
Models
- tcr_deep_insight.model.modeling_bert.get_human_config(bert_type: Literal['tiny', 'small', 'base'] = 'small', vocab_size: int | None = None, alibi_starting_size: int = 512) BertConfig[source]
Get the configuration for the human TCR BERT
- Parameters:
bert_type – The size of the BERT model. Must be one of ‘tiny’, ‘small’, or ‘base’
vocab_size – The size of the vocabulary
alibi_starting_size – The size of the input sequence
- Returns:
The configuration for the human TCR BERT
- class tcr_deep_insight.model.TRabModelingBertForPseudoSequence(bert_config: BertConfig, pooling: Callable | Literal['cls', 'mean', 'max', 'cdr3a', 'cdr3b', 'weighted'] = 'mean', pooling_weight: Tensor | None = None, labels_number: int = 1, use_triton: bool = False, device='cuda')[source]
Bases:
Module
- class tcr_deep_insight.model.TRabModelingBertForVJCDR3(bert_config: BertConfig, pooling: Literal['cls', 'mean', 'max', 'pool', 'trb', 'tra', 'weighted'] = 'mean', pooling_cls_position: int = 0, pooling_weight=(0.1, 0.9), labels_number: int = 1, device='cuda')[source]
Bases:
ModuleBase
- tcr_deep_insight.model.GEXModelingVAE
alias of
scAtlasVAE
Model options
Training Utilities
- class tcr_deep_insight.model.TRabModelingBertForVJCDR3Trainer(model: TRabModelingBertForVJCDR3, collator: TRabCollatorForVJCDR3, train_dataset: Dataset | None = None, test_dataset: Dataset | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), loss_weight: Mapping[str, float] = {'reconstruction_loss': 1}, device: str = 'cuda')[source]
Bases:
TrainerBase
- class tcr_deep_insight.model.TRabTokenizerForVJCDR3(*, tra_max_length: int, trb_max_length: int, pad_token: str | None = None, unk_token: str | None = None, mask_token: str | None = None, cls_token: str | None = None, sep_token: str | None = None, species: Literal['human', 'mouse'] = 'human', **kwargs)[source]
Bases:
AminoAcidTokenizerTokenizer for TRA and TRB sequence. Encode V,J genes into tokens, and CDR3 into amino acids
- convert_tokens_to_ids(sequence: List[Tuple[str]] | Tuple[str], alpha_vj: List[Tuple[str]] | Tuple[str] | None = None, beta_vj: List[Tuple[str]] | Tuple[str] | None = None)[source]
- to_dataset(df: DataFrame | None = None, ids: Iterable[str] | None = None, alpha_chains: Iterable[str] | None = None, beta_chains: Iterable[str] | None = None, alpha_v_genes: Iterable[str] | None = None, alpha_j_genes: Iterable[str] | None = None, beta_v_genes: Iterable[str] | None = None, beta_j_genes: Iterable[str] | None = None, pairing: Iterable[int] | None = None, split: bool = False)[source]
Plotting
- tcr_deep_insight.pl.set_plotting_params(dpi: int, fontsize: int = 12, fontfamily: str = 'Arial', linewidth: float = 0.5)[source]
Set default plotting parameters for matplotlib.
- Parameters:
dpi – dpi for saving figures
fontsize – default fontsize
fontfamily – default fontfamily
linewidth – default linewidth
- tcr_deep_insight.pl.create_fig(figsize=(8, 4)) Tuple[Figure, Axes][source]
Create a figure with a single axis.
- Parameters:
figsize – figure size. Default: (8, 4)
:return matplotlib.figure.Figure, matplotlib.axes.Axes
- tcr_deep_insight.pl.create_subplots(nrow, ncol, figsize=(8, 8), gridspec_kw={}) Tuple[Figure, ndarray][source]
Create a figure with multiple axes.
- Parameters:
nrow – number of rows
ncol – number of columns
figsize – figure size. Default: (8, 8)
gridspec_kw – gridspec_kw. Default: {}
:return matplotlib.figure.Figure, matplotlib.axes.Axes
- tcr_deep_insight.pl.plot_cdr3_sequence(sequences: List[str], alignment: bool = False, labels: List[str] | None = None, labels_palette: Mapping[str, Any] | None = None, labels_postfix: str | None = None, labels_postfix_palette: Mapping[str, Any] | None = None, ax: Axes | None = None) Tuple[Figure, Axes][source]
Plot CDR3 sequences.
- Parameters:
sequences – a list of CDR3 sequences
alignment – whether to align the sequences. Default: False
labels – a list of labels. Default: None
labels_palette – a dictionary of labels and colors. Default: None
ax – matplotlib.axes.Axes. Default: None
- Returns:
matplotlib.figure.Figure, matplotlib.axes.Axes
- tcr_deep_insight.pl.plot_selected_tcrs(tcr_cluster_result: TDIResult, color: str, tcrs: List[str], tcrs_background: List[str], palette: dict | None = None)[source]
Plot the tcrs on the umap of the gex data, with the TCRs as a pie chart and logo plot
- Parameters:
tcr_cluster_result – TDIResult
color – str
tcrs – list
palette – dict (optional)
- Returns:
fig, ax
Note
You should have mafft installed in your system to use this function