API

class tcr_deep_insight.SPECIES(value)[source]

An enumeration.

HUMAN = 'human'

MOUSE = 'mouse'

TCR class to store TCR information and provide utility functions.

Parameters:

cdr3a – CDR3 alpha sequence
cdr3b – CDR3 beta sequence
trav – TRAV gene
trbv – TRBV gene
traj – TRAJ gene
trbj – TRBJ gene
individual – Individual identifier
species – Species identifier

property cdr3a

property cdr3b

classmethod deserialize(string)[source]

classmethod from_string(string: str)[source]

property individual

serialize()[source]

property species

to_string()[source]

to_tcr_string()[source]

property traj

property trav

property trbj

property trbv

Preprocessing

tcr_deep_insight.pp.update_anndata(gex_adata: AnnData, gex_embedding_key: str = 'X_gex', tcr_embedding_key: str = 'X_tcr', joint_embedding_key: str = 'X_gex_tcr') → None[source]

Update the adata with the embedding keys

Note

TCR information should be included in gex_adata.obs. This method modifies the gex_adata inplace. added columns in .obs: tcr, CDR3a, CDR3b, TRAV, TRAJ, TRBV, TRBJ

Parameters:

gex_adata – AnnData object
gex_embedding_key – embedding key for gex
tcr_embedding_key – embedding key for tcr
joint_embedding_key – embedding key for joint

tcr_deep_insight.pp.unique_tcr_by_individual(gex_adata: ~anndata._core.anndata.AnnData, embedding_key: str | ~typing.Iterable[str] = 'X_gex', label_key: str | None = None, additional_label_keys: ~typing.Iterable[str] = None, aggregate_func: ~typing.Callable = <function majority_vote>) → AnnData[source]

Unique TCRs by individual and aggregate GEX embedding by TCR. Unique TCR is defined by the combination of TRAV,TRAJ,TRBV,TRBJ,CDR3α,CDR3β and individual. Also aggregate GEX embedding by TCR, and add the aggregated GEX embedding to the tcr_adata.obsm[gex_embedding_key].

Note

“individual” should be in gex_adata.obs.columns.

Parameters:

gex_adata – AnnData object of gene expression data
embedding_key – Key(s) in adata.obsm where GEX embedding is stored. Default: ‘X_gex’
label_key – Key in adata.obs where TCR type abels are stored. Default: ‘cell_type’, where ‘cell_type’ should be included in adata.obs.columns
additional_label_keys – Additional keys in adata.obs where TCR type labels are stored. Default: None
map_function – Function to aggregate labels. Default: majority_vote

Returns:

TCR adata

Data

tcr_deep_insight.data.human_gex_reference_v2()[source]: Load the human gex reference v2. If the dataset is not found, it will be downloaded from Zenodo.

tcr_deep_insight.data.human_tcr_reference_v2()[source]: Load the human tcr reference v2. Can be generated from the human gex reference v2 via tdi.pp.unique_tcr_by_individual. If the dataset is not found, it will be downloaded from Zenodo.

tcr_deep_insight.data.mouse_gex_reference_v1()[source]: Load the mouse gex reference v1.

tcr_deep_insight.data.mouse_tcr_reference_v1()[source]: Load the mouse tcr reference v1. Can be generated from the mouse gex reference v1 via tdi.pp.unique_tcr_by_individual

Tool

tcr_deep_insight.tl.add_tcr_pseudosequence_to_dataframe(df: DataFrame, species: bool = 'human') → None[source]

Add TCR pseudosequence to dataframe

Parameters:

df – dataframe

Note:

This method modifies the df inplace and adds the following columns:

tcr_pseudosequence

tcr_deep_insight.tl.add_pmhc_pseudosequence_to_dataframe(df: DataFrame) → None[source]

Add PMHC pseudosequence to dataframe

Parameters:

df – dataframe

Note:

This method modifies the df inplace and adds the following columns:

hla_pseudosequence
pmhc_pseudosequence

tcr_deep_insight.tl.tcr_adata_to_datasets(adata: AnnData, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer | None = None, show_progress: bool = False) → Dataset[source]

Convert adata to tcr datasets

Parameters:

adata – AnnData
tokenizer – tokenizer

Returns:

tcr datasets

tcr_deep_insight.tl.tcr_dataframe_to_datasets(df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, show_progress: bool = False) → Dataset[source]

Convert dataframe to tcr datasets

Parameters:

df – dataframe
tokenizer – tokenizer

Returns:

tcr datasets

tcr_deep_insight.tl.to_embedding_tcr_only(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, tcr_dataset: Dataset, k: str = 'hidden_states', device: str = 'cuda', n_per_batch: int = 64, show_progress: bool = False) → ndarray[source]

Get embedding from model

Parameters:

model – nn.Module. The TCR model.
tcr_dataset – datasets.arrow_dataset.Dataset. evaluation datasets.
k – str. ‘hidden_states’ or ‘last_hidden_state’.
device – str. ‘cuda’ or ‘cpu’. If ‘cuda’, use GPU. If ‘cpu’, use CPU.
n_per_batch – int. Number of samples per batch.
show_progress – bool. If True, show progress bar.

Returns:

embedding

tcr_deep_insight.tl.to_embedding_tcr_only_from_pandas(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, device: str, n_per_batch: int = 64, pooling: TCR_BERT_POOLING = 'mean')[source]

tcr_deep_insight.tl.get_pretrained_tcr_embedding(tcr_adata: AnnData, bert_config: Mapping[str, Any], checkpoint_path: str | PathLike, encoding: TCR_BERT_ENCODING = 'vjcdr3', pooling: TCR_BERT_POOLING = 'mean', species: Literal['human', 'mouse'] = 'human', pca_path: str | None = None, use_pca: bool = True, use_kernel_pca: bool = False, use_faiss_pca: bool = True, pca_n_components: int = 50, device: str = 'cuda:0', n_per_batch: int = 256)[source]

Get TCR embedding from pretrained BERT model

Note

This method modifies the tcr_adata inplace. It adds the following fields to tcr_adata.obsm:

X_tcr: TCR embedding
X_tcr_pca: PCA of TCR embedding

Parameters:

tcr_adata – AnnData object containing TCR data
bert_config – BERT config
checkpoint_path – Path to pretrained BERT model.
encoding – Encoding type of tcr sequence.
pooling – Pooling method for tcr representation.
species – Species.
pca_path – Path to PCA model, if previously saved.
use_pca – Whether to use PCA. Default: True
use_kernel_pca – Whether to use Kernel PCA. High memory required for large dataset.
use_faiss_pca – Whether to use Faiss PCA instead fo scikit-learn.
pca_n_components – Number of PCA components.
device – Device for Faiss PCA computation.
n_per_batch – Number of samples per batch in getting TCR embeddings.

Clustering

tcr_deep_insight.tl.cluster_tcr(tcr_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, layer_norm: bool = True, max_distance: float = 4.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') → TDIResult[source]

Cluster TCRs by joint TCR-GEX embedding. All TCRs will be used as cluster anchors.

Parameters:

tcr_adata – AnnData object containing TCR data
label_key – Key of the label to cluster. Should be in tcr_adata.obs.columns
use_gpu – Whether to use GPU.
gpu – GPU ID if use_gpu.
pure_label – Whether to constrain all TCRs in a cluster to have the same label.
pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.
same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.
same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.
same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.
same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.
layer_norm – Whether to use LayerNorm on tcr embedding. Default: True
max_distance – Maximum TrGx distance. Default: 4.
max_cluster_size – Maximum cluster size for dTCR clusters.
use_gex – Whether to use GEX embedding for clustering.
filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.
nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size
calculate_perm_test – Whether to calculate p-values from permutation test. Default: True
species – Species name.
n_jobs – Number of threads for parallel processing.
faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS

Returns:

TDIResult containing clustered TCRs.

Note

The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.

tcr_deep_insight.tl.cluster_tcr_from_reference(tcr_adata: ~anndata._core.anndata.AnnData, tcr_reference_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, layer_norm: bool = True, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, max_distance: float = 3.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') → TDIResult[source]

Cluster TCRs from reference. Only TCRs in the query dataset will be used as cluster anchors.

Parameters:

tcr_adata – AnnData object containing TCR data
tcr_reference_adata – AnnData object containing reference TCR data
label_key – Key of the label to cluster. Should be in tcr_adata.obs
use_gpu – Whether to use GPU.
gpu – GPU ID if use_gpu.
pure_label – Whether to constrain all TCRs in a cluster to have the same label.
pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.
same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.
same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.
same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.
same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.
layer_norm – Whether to use LayerNorm on tcr embedding. Default: True
max_distance – Maximum TrGx distance. Default: 4.
max_cluster_size – Maximum cluster size for dTCR clusters.
use_gex – Whether to use GEX embedding for clustering.
filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.
nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size
calculate_perm_test – Whether to calculate morista horn permutation test for TCR clusters.
species – Species name.
n_jobs – Number of threads for parallel processing. Default: 1
faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS

Returns:

TDIResult object containing clustered TCRs.

Note

The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.

tcr_deep_insight.tl.inject_labels_for_tcr_cluster_adata(reference_data: ~anndata._core.anndata.AnnData | ~pandas.core.frame.DataFrame, tcr_cluster_adata: ~anndata._core.anndata.AnnData, label_key: str, map_function: ~typing.Callable = <function majority_vote>)[source]

Inject labels for tcr_cluster_adata based on reference_adata

Parameters:

reference_adata – sc.AnnData. Reference AnnData object containing labels
tcr_cluster_adata – sc.AnnData
label_key – str. Key of the label to use for clustering in reference_adata.obs.columns
map_function – Callable. Default: function that returns the most frequent label.

Note:

This method modifies the df inplace and adds the following columns:

label_key: The most frequent label in the cluster

class tcr_deep_insight.tl.TDIResult(_data: AnnData, _tcr_df: DataFrame | None = None, _tcr_adata: AnnData | None = None, _gex_adata: AnnData | None = None, _cluster_label: str | None = None, faiss_index: IndexFlatL2 | None = None, low_memory: bool = False)[source]

Bases: object

A class to store the TDI clustering result

Parameters:

_data – the clustering result
_tcr_df – the tcr dataframe
_tcr_adata – the tcr anndata
_gex_adata – the gex anndata
_cluster_label – the cluster label
faiss_index – the faiss index
low_memory – whether to use low memory mode

property D

property I

calculate_cluster_additional_information()[source]

property cluster_label

property data

get_tcrs_by_cluster_index(cluster_index: int, _n_after: int = 0) → List[str][source]

Get the tcrs for a specific cluster

Parameters:

cluster_index – the cluster index
_n_after – the number of tcrs to skip

get_tcrs_for_cluster(label: Mapping[str, str] | Mapping[str, List[str]] | None = None, rank: int = 0, rank_by: Literal['convergence', 'disease_association'] = 'convergence', min_unique_tcr_number: int = 4, min_individual_number: int = 2, min_cell_number: int = 10, min_tcr_convergence_score: float | None = None, min_disease_association_score: float | None = None, return_background_tcrs: bool = False, additional_label_key_values: Dict[str, List[str]] | None = None)[source]

Get the tcrs for a specific cluster

Parameters:

label – the cluster label
rank – the rank of the tcrs to return
rank_by – the metric to rank the tcrs
min_unique_tcr_number – the minimum number of unique tcrs in the cluster
min_individual_number – the minimum number of individuals in the cluster
min_cell_number – the minimum number of cells in the cluster
min_tcr_convergence_score – the minimum convergence score
min_disease_association_score – the minimum disease specificity score
return_background_tcrs – whether to return other tcrs in the cluster
additional_label_key_values – additional label key values to filter the cluster

Returns:

a dictionary containing the tcrs and their metadata

property gex_adata

classmethod load_from_disk(save_path: str, tcr_data_path: str | None = None, gex_adata_path: str | None = None)[source]

Load the cluster result from disk

Parameters:

save_path – the path to load the cluster result
tcr_data_path – the path to load the tcr data
gex_adata_path – the path to load the gex data

save_to_disk(save_path, save_cluster_result_as_csv=True, save_tcr_data=True, save_gex_data=True)[source]

Save the cluster result to disk

Parameters:: save_path – the path to save the cluster result

save_cluster_result_as_csv: whether to save the cluster result as csv files :param save_tcr_data: whether to save the tcr data :param save_gex_data: whether to save the gex data

select(indices)[source]

property tcr_adata

property tcr_df

to_pandas_dataframe_cluster_index()[source]

to_pandas_dataframe_tcr(rank_by: Literal['convergence', 'disease_association', 'individual', 'unique_tcr'] = 'convergence', return_background_tcrs: bool = False)[source]

Convert the cluster result to a pandas dataframe.

Parameters:

rank_by – the metric to rank the tcrs
return_background_tcrs – whether to return background tcrs for each cluster

Returns:

a pandas dataframe containing the cluster result

Clustering options

class tcr_deep_insight.tl.FAISS_INDEX_BACKEND(value)[source]

An enumeration.

FLAT = 'flat'

KMEANS = 'kmeans'

Models

tcr_deep_insight.model.modeling_bert.get_human_config(bert_type: Literal['tiny', 'small', 'base'] = 'small', vocab_size: int | None = None, alibi_starting_size: int = 512) → BertConfig[source]

Get the configuration for the human TCR BERT

Parameters:

bert_type – The size of the BERT model. Must be one of ‘tiny’, ‘small’, or ‘base’
vocab_size – The size of the vocabulary
alibi_starting_size – The size of the input sequence

Returns:

The configuration for the human TCR BERT

class tcr_deep_insight.model.TRabModelingBertForPseudoSequence(bert_config: BertConfig, pooling: Callable | Literal['cls', 'mean', 'max', 'cdr3a', 'cdr3b', 'weighted'] = 'mean', pooling_weight: Tensor | None = None, labels_number: int = 1, use_triton: bool = False, device='cuda')[source]

Bases: Module

forward(*, input_ids: Tensor, attention_mask: Tensor, labels: Tensor | None = None, output_hidden_states=True)[source]

Forward pass of the model

Parameters:

input_ids – Input ids
attention_mask – Attention mask
labels – Labels
token_type_ids – Token type ids

Returns:

Output of the model

class tcr_deep_insight.model.TRabModelingBertForVJCDR3(bert_config: BertConfig, pooling: Literal['cls', 'mean', 'max', 'pool', 'trb', 'tra', 'weighted'] = 'mean', pooling_cls_position: int = 0, pooling_weight=(0.1, 0.9), labels_number: int = 1, device='cuda')[source]

Bases: ModuleBase

forward(*, input_ids: Tensor, attention_mask: Tensor, labels: Tensor, token_type_ids: Tensor, output_hidden_states=True)[source]

Forward pass of the model

Parameters:

input_ids – Input ids
attention_mask – Attention mask
labels – Labels
token_type_ids – Token type ids

Returns:

Output of the model

tcr_deep_insight.model.GEXModelingVAE: alias of scAtlasVAE

Model options

class tcr_deep_insight.model.TCR_BERT_ENCODING(value)[source]

An enumeration.

CDR123 = 'cdr123'

VJCDR3 = 'vjcdr3'

class tcr_deep_insight.model.TCR_BERT_POOLING(value)[source]

An enumeration.

CLS = 'cls'

MAX = 'max'

MEAN = 'mean'

POOL = 'pool'

SUM = 'sum'

TRA = 'tra'

TRB = 'trb'

WEIGHTED = 'weighted'

Training Utilities

class tcr_deep_insight.model.TRabModelingBertForVJCDR3Trainer(model: TRabModelingBertForVJCDR3, collator: TRabCollatorForVJCDR3, train_dataset: Dataset | None = None, test_dataset: Dataset | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), loss_weight: Mapping[str, float] = {'reconstruction_loss': 1}, device: str = 'cuda')[source]

Bases: TrainerBase

evaluate(*, n_per_batch: int = 10, max_train_sequence: int | None = None, max_test_sequence: int | None = None, show_progress: bool = False)[source]: Main training entry point

fit(*, max_epoch: int, max_train_sequence: int = 0, n_per_batch: int = 10, shuffle: bool = False, balance_label: bool = False, label_weight: Tensor | None = None, show_progress: bool = False, early_stopping: bool = False)[source]: Main training entry point

class tcr_deep_insight.model.TRabTokenizerForVJCDR3(*, tra_max_length: int, trb_max_length: int, pad_token: str | None = None, unk_token: str | None = None, mask_token: str | None = None, cls_token: str | None = None, sep_token: str | None = None, species: Literal['human', 'mouse'] = 'human', **kwargs)[source]

Bases: AminoAcidTokenizer

Tokenizer for TRA and TRB sequence. Encode V,J genes into tokens, and CDR3 into amino acids

convert_ids_to_tokens(ids: Tensor)[source]

convert_tokens_to_ids(sequence: List[Tuple[str]] | Tuple[str], alpha_vj: List[Tuple[str]] | Tuple[str] | None = None, beta_vj: List[Tuple[str]] | Tuple[str] | None = None)[source]

to_dataset(df: DataFrame | None = None, ids: Iterable[str] | None = None, alpha_chains: Iterable[str] | None = None, beta_chains: Iterable[str] | None = None, alpha_v_genes: Iterable[str] | None = None, alpha_j_genes: Iterable[str] | None = None, beta_v_genes: Iterable[str] | None = None, beta_j_genes: Iterable[str] | None = None, pairing: Iterable[int] | None = None, split: bool = False)[source]

class tcr_deep_insight.model.TRabCollatorForVJCDR3(tra_max_length: int, trb_max_length: int, mask_token_id: int = 22, mlm_probability: float = 0.1, mask_trb_probability: float = 0.5, species: Literal['human', 'mouse'] = 'human')[source]: Bases: object

Plotting

tcr_deep_insight.pl.set_plotting_params(dpi: int, fontsize: int = 12, fontfamily: str = 'Arial', linewidth: float = 0.5)[source]

Set default plotting parameters for matplotlib.

Parameters:

dpi – dpi for saving figures
fontsize – default fontsize
fontfamily – default fontfamily
linewidth – default linewidth

tcr_deep_insight.pl.create_fig(figsize=(8, 4)) → Tuple[Figure, Axes][source]

Create a figure with a single axis.

Parameters:: figsize – figure size. Default: (8, 4)

:return matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.create_subplots(nrow, ncol, figsize=(8, 8), gridspec_kw={}) → Tuple[Figure, ndarray][source]

Create a figure with multiple axes.

Parameters:

nrow – number of rows
ncol – number of columns
figsize – figure size. Default: (8, 8)
gridspec_kw – gridspec_kw. Default: {}

:return matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.plot_cdr3_sequence(sequences: List[str], alignment: bool = False, labels: List[str] | None = None, labels_palette: Mapping[str, Any] | None = None, labels_postfix: str | None = None, labels_postfix_palette: Mapping[str, Any] | None = None, ax: Axes | None = None) → Tuple[Figure, Axes][source]

Plot CDR3 sequences.

Parameters:

sequences – a list of CDR3 sequences
alignment – whether to align the sequences. Default: False
labels – a list of labels. Default: None
labels_palette – a dictionary of labels and colors. Default: None
ax – matplotlib.axes.Axes. Default: None

Returns:

matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.plot_selected_tcrs(tcr_cluster_result: TDIResult, color: str, tcrs: List[str], tcrs_background: List[str], palette: dict | None = None)[source]

Plot the tcrs on the umap of the gex data, with the TCRs as a pie chart and logo plot

Parameters:

tcr_cluster_result – TDIResult
color – str
tcrs – list
palette – dict (optional)

Returns:

fig, ax

Note

You should have mafft installed in your system to use this function

API

Preprocessing

Data

Tool

Clustering

Clustering options

Models

Model options

Training Utilities

Plotting

Additional Utilities