1. scVMAP process ============================================== | scVMAP: https://bio.liclab.net/scvmap/ | scVMAP reproducibility: https://github.com/YuZhengM/scvmap_reproducibility | scVMAP tutorial: https://scvmap.readthedocs.io/en/latest/ | scVMAP front-end: https://github.com/YuZhengM/scvmap_web | scVMAP back-end: https://github.com/YuZhengM/scvmap | scVMAP API: https://bio.liclab.net/scvmap_service/scvmap.html `scVMAP `_ gathers and provides scATAC-seq datasets with established cell type labels (from scATAC-Ref, GreenleafLab and PlaqView) and cell type labels annotated by singleR (from scBlood), along with detailed fine-mapping data (from CAUSALdb2, FinnGen, UKBB and BBJ). scVMAP, by employing the SCAVENGE and g-chromVAR methods, assigns trait-relevant scores at single-cell resolution, uncovering biologically relevant cell types. The core original intention and purpose of `scVMAP `_ is to provide a resource of trait-relevant significant cell types, offering researchers interested in this objective a convenient and fast retrieval interface. It uses widely recognized methods to provide reference for key variants, TFs, and genes involved in relevant potential enrichment mechanisms, offering researchers a direction for reference, reducing their time, and making a modest contribution to the scientific research community. Throughout the process, data undergoes standardization, unified processing, and comprehensive analysis. The main contents are as follows: + Single-cell samples: * Differential gene activity analysis of cell types * Differential TF activity analysis of cell types * High-scoring gene pathway enrichment analysis of cell types + Traits or diseases: * LD-based MAGMA analysis (Trait-relevant genes) * HOMER motif enrichment analysis (Trait-relevant TFs) * MAGMA-based trait-relevant gene pathway enrichment analysis + Integrated analysis: * ATAC-based Cicero analysis (Trait-relevant genes) * ATAC-based GimmeMotifs analysis (Trait-relevant genes) * Gene hub regulatory network analysis * TF hub regulatory network analysis * Integrated regulatory network analysis The `scVMAP `_ database primarily provides the following functions: ``Data-browse``, ``Search``, ``Analysis``, ``Statistics``, ``Download``, ``Contact``, and ``Online analysis``. These functionalities streamline the process of studying single-cell genetic regulation, enabling users to easily explore and interpret biological mechanisms in specific contexts. In conclusion, (scVMAP, https://bio.liclab.net/scvmap/) is a user-centric database that facilitates the exploration and interpretation of single-cell genetic data through structured displays and dynamic visualizations. It offers intuitive workflows, customizable parameters and comprehensive data accessibility. .. image:: ./img/overview.png .. tip:: This section provides a comprehensive explanation of the entire pipeline for the scVMAP platform, encompassing the stages of data acquisition, curation, analysis, and the ultimate platform realization. 1.1 Data processing methods -------------------------------- We provide the software and version information required for data processing in the scVMAP platform, along with the corresponding code to reproduce the results. ================= =============== =================================================================================== Method Version Link ================= =============== =================================================================================== g-chromVAR 0.3.2 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/R SCAVENGE 1.0.2 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/R LiftOver - https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/liftOver SnapATAC2 2.7.1 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2 SCANPY 1.11.5 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2 GSEApy 1.1.9 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2 chromVAR 1.24.0 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/chromVAR Cicero 1.22.0 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/cicero MAGMA 1.10 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/magma BEDTools 2.31.1 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/interaction HOMER 5.1 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/homer GimmeMotifs 0.18 https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/gimme ================= =============== =================================================================================== 1.2 Data collection and curation -------------------------------- The scVMAP platform collects scATAC-seq data and fine-mapping result data. 1.2.1 Collection of scATAC-seq data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The collected scATAC-seq data mainly comes from ``scATAC-Ref``, ``scBlood``, ``GreenleafLab``, and ``PlaqView`` sources. A total of ``183`` single-cell sample data were collected, with detailed information as follows: | Download url: `sample_info_with_age_sex_drug.txt `_ .. note:: When preprocessing scATAC-seq data through `SnapATAC2 `_, only the ``min_tsse`` parameter varies for different samples (Excluding input files and reference genome information). 1.2.1.1 scATAC-Ref """""""""""""""""""""""""" | Source url: `https://bio.liclab.net/scATAC-Ref/ `_ We downloaded human data from scATAC-Ref and retained a total of 28 scATAC-seq datasets. We integrated all GSM samples from GSE195460 in scATAC-Ref to create the dataset "sample_id_30 (GSE195460_diabetic_kidney)". Detailed processing steps: 1. Download the RDS files. #. Standardize the filenames to the format: ``_ATAC.rds``. #. Generate the metadata files: ``barcodes.tsv``, ``matrix.mtx``, and ``peaks.bed``. #. Generate the ``_fragments.tsv.gz`` file. #. Perform preprocessing using the SnapATAC2 software with the following specific parameters: .. code-block:: python min_num_fragments = 200, bin_size = 500, min_tsse = , need_features = 500000, min_cells = 5, data = snap.pp.import_data( fragment_file="_fragments.tsv.gz", chrom_sizes=, file=, min_num_fragments=min_num_fragments, sorted_by_barcode=False ) # the standard procedures to add tile matrices, select features, and identify doublets snap.metrics.tsse(data, ) snap.pp.filter_cells(data, min_tsse=min_tsse) snap.pp.add_tile_matrix(data, bin_size=bin_size) snap.pp.select_features(data, n_features=need_features) snap.pp.scrublet(data, features=features) snap.pp.filter_doublets(data) Please see `scVMAP-reproducibility-SnapATAC2 `_ for the detailed workflow code. .. note:: The cell type labels were annotated based on the original publications of each scATAC-seq sample and are highly reliable. For all single-cell samples except "sample_id_20" (Alzheimer’s Disease (AD)), we utilized the uniform manifold approximation and projection (UMAP) coordinates from their original collection sources for clustering. Due to the poor clustering performance of the original UMAP coordinates for "sample_id_20", we processed the binarized sparse counts matrix using SnapATAC2, converted it to a term frequency-inverse document frequency (TF-IDF) matrix, and subsequently extracted representative low-dimensional features through latent semantic indexing (LSI). Finally, we performed UMAP on this low-dimensional feature matrix to obtain the coordinates for "sample_id_20". 1.2.1.2 scBlood """""""""""""""""""""""""" | Source url: `https://bio.liclab.net/scBlood/ `_ We downloaded human data from scBlood and retained a total of 152 scATAC-seq datasets. The processing pipeline is identical to that of scATAC-Ref. .. note:: The cell type labels were annotated with SingleR. Their reliability should be treated as indicative. 1.2.1.3 GreenleafLab """""""""""""""""""""""""" | Source url: `https://github.com/GreenleafLab/MPAL-Single-Cell-2019 `_ We downloaded a scATAC-seq dataset for PBMC. The processing pipeline is identical to that of scATAC-Ref. 1.2.1.4 PlaqView """""""""""""""""""""""""" | Source url: `https://www.plaqview.com/ `_ We downloaded a scATAC-seq dataset for coronary artery disease (CAD). The processing pipeline is identical to that of scATAC-Ref. 1.2.1.5 Summary """""""""""""""""""""""""" Here are the specific parameter settings for ``min_tsse``. ================= =============== Sample ID min_tsse ================= =============== sample_id_1-30 8 sample_id_31-183 5 ================= =============== The scATAC-seq data is obtained through `download `_ page. Cell type annotations were directly assigned from their original articles, whereas the scATAC-seq samples obtained from scBlood were annotated using the SingleR software. Besides cell type annotation, we also performed annotation for age, sex, and drug resistance, involving 24, 19, and 2 samples, respectively. It can be viewed via the `browser `_ page. 1.2.2 Collection of trait data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | FINEMAP fine-mapping result download url (15805): `trait_info.xlsx `_ | SuSiE fine-mapping result download url (79): `trait_info_susie.xlsx `_ 1.2.2.1 FINEMAP """""""""""""""""""""""""" The collected FINEMAP fine-mapping result data comes from ``CAUSALdb2``, ``UKBB``, ``FinnGen``, and ``BJJ`` sources. ============ ============ ========================== ============================== ========= ============== ============ ============= ============================================================ ================ Source ID Source name Source author, year Source description PMID Source genome Trait count Filter count Source link Source version ============ ============ ========================== ============================== ========= ============== ============ ============= ============================================================ ================ source_id_1 CAUSALdb Jianhua Wang et al., 2024 CAUSALdb fine-mapping results 31691819 hg19 15038 14417 http://www.mulinlab.org/causaldb/index.html Release 2.1 source_id_2 UKBB Wang, Q.S. et al., 2021 UKBB fine-mapping results — hg19 94 94 https://www.finucanelab.org/data Release 1.1 source_id_3 FinnGen Kurki, M.I. et al., 2023 FinnGen fine-mapping results 36653562 hg38 1234 1215 https://www.finngen.fi/en/access_results R11 source_id_4 BBJ Kanai, M. et al., 2021 BBJ fine-mapping results — hg19 79 79 https://humandbs.dbcls.jp/en/hum0197-v18#hum0197.v5.gwas.v1 — ============ ============ ========================== ============================== ========= ============== ============ ============= ============================================================ ================ For each trait, we retained the variants with a causal variant ``probability value (PP) > 0.001`` calculated by FINEMAP. As a result, we retained ``15,805`` traits from the initial ``16,445`` traits and used them for subsequent analysis. 1.2.2.2 SuSiE """""""""""""""""""""""""" The collected 79 SuSiE fine-mapping results (PP > 0.001) come from ``BJJ`` source. 1.2.2.3 Summary """""""""""""""""""""""""" The trait data is obtained through `download `_ page. To harmonize genomic coordinates between variants and scATAC-seq data, we performed LiftOver to convert variant positions to match the reference genome version used in single-cell analysis. Next, we manually categorized them into a broad array of classifications, including diseases, indicators, drugs, chemical compounds, health care, treatments, therapies and symptoms. For the disease category, we annotated diseases according to ICD-10, encompassing all ``22`` major disease categories, and further subclassifying them into more than ``250`` specific subcategories to provide an intuitive and convenient reference. Please see `scVMAP-reproducibility-Trait `_ for the detailed workflow code. It can be viewed via the `browser `_ page. .. tip:: Currently, SuSiE fine-mapping results can only be viewed through the "Data-browse" interface. Other interfaces, such as "Search" and "Analysis", do not retrieve SuSiE data and are limited to ``FINEMAP`` results. Support for ``SuSiE`` data in these interfaces will be added in a future update. In addition, we will gradually expand the data of SuSiE method and other fine mapping methods. 1.3 Variant-function-mapping at single cell resolution ------------------------------------------------------ Our preferred method for calculating trait relevance scores (TRSs) is `SCAVENGE `_. Since the `SCAVENGE `_ method utilizes the `g-chromVAR `_ approach, scVMAP supports both `g-chromVAR `_ and `SCAVENGE `_ for computing TRSs. Specific process code: `scVMAP-R `_ 1.3.1 g-chromVAR ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Building upon the original `g-chromVAR `_ codebase, we have addressed the issues caused by ``NA/INF`` values and refined the corresponding implementation. 1.3.2 SCAVENGE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The values of some parameter settings. ========================================================= =============== Hyper parameter Value ========================================================= =============== Dimensionality of latent semantic indexing (LSI) 30 Number of neighbors in mutual k-nearest neighbors (M-kNN) 30 Restart value for random walk 0.05 ========================================================= =============== 1.3.3 Code for using scVMAP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We have consolidated the SCAVENGE workflow into a single command. You can view the code here: `scVMAP-exec_R_code `_ The specific implementation process is as follows: 1.3.3.1 Install R packages """"""""""""""""""""""""""""""""""""" Refer to file `scVMAP-install_R_packages `_ for installation. 1.3.3.2 Download files """"""""""""""""""""""""""""""""""""" The following two files need to be downloaded through link `scVMAP-exec_R_code `_: - ``integration.R``: A file for g-chromVAR and SCAVENGE algorithm code. - ``run.R``: An R script for executing the g-chromVAR and SCAVENGE algorithms via ``Rscript`` in the Linux terminal. .. tip:: The `online analysis function `_ of scVMAP uses this script command. 1.3.3.3 Create formatted directories """"""""""""""""""""""""""""""""""""" Create a root path: ``/project/scVMAP``. Other path details: - ``/project/scVMAP/code``: Path for the ``integration.R`` and ``run.R`` files. - ``/project/scVMAP/result``: Path for saving intermediate files and result files. - ``/project/scVMAP/scATAC``: This path stores the input scATAC-seq data in RDS format. File example: `scATAC-seq-example `_. - ``/project/scVMAP/variant``: This path stores the input phenotype data in BED or TXT format. File example: `Trait-example `_. 1.3.3.4 Execute command """"""""""""""""""""""""""""""""""""" The command is as follows: :: /bin/Rscript $basePath/code/run.R $basePath $identifier $scFile $variantFile $genome $layer For example: :: /bin/Rscript /project/scVMAP/code/run.R /project/scVMAP 7627190552 26286db074_GSE139369_ELM_sim_0.7_ATAC.rds 4880d7db5c_BBJ_RBC_64.bed hg19 counts Parameter description: .. py:function:: core_process($basePath, $identifier, $scFile, $variantFile, $genome, $layer); Command-line Parameter Description :param string $basePath: The root path for processing. :param string $identifier: A unique ID used as the filename to save the results. :param string $scFile: The filename of the scATAC-seq data. :param string $variantFile: The filename of the trait file. :param string $genome: The reference genome (hg19 or hg38). :param string $layer: The name of the layer for the counts matrix in the RDS file, e.g., "counts". :rtype: None 1.4 Differential gene activity and pathway enrichment analysis across cell types -------------------------------------------------------------------------------- Please see `scVMAP-reproducibility-SnapATAC2 `_ for the detailed workflow code. 1.4.1 Differential activity genes of cell types ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We used SnapATAC2 to calculate gene activity data for single cells. Then, we calculated differentially active genes across cell types using SCANPY. Here is the SnapATAC2 code content: .. code-block:: python # Create the cell by gene activity matrix # The `adata` represents the result preprocessed by SnapATAC2. selected_list = np.array(list(adata.var["selected"])) selected_adata = adata[:, selected_list] gene_matrix = snap.pp.make_gene_matrix(selected_adata, genome_anno) # normalize sc.pp.filter_genes(gene_matrix, min_cells=min_cells) sc.pp.normalize_total(gene_matrix) sc.pp.log1p(gene_matrix) sc.external.pp.magic(gene_matrix, solver="approximate") Here is the SCANPY code content: .. code-block:: python def get_difference_genes( adata: AnnData, cluster: str, method: _Method = "wilcoxon", cell_anno: Optional[DataFrame] = None, diff_genes_file: Optional[str] = None ) -> AnnData: import scanpy as sc # add cell annotation information adata.obs = add_cluster_info(adata.obs, cell_anno, cluster) if "log1p" not in adata.uns_keys(): adata.uns["log1p"] = {} if "base" not in adata.uns["log1p"].keys(): adata.uns["log1p"].update({"base": None}) # gene ul.log(__name__).info("Rank genes for characterizing groups.") with warnings.catch_warnings(): warnings.simplefilter("ignore") sc.tl.rank_genes_groups(adata=adata, groupby=cluster, method=method, use_raw=False) # get difference genes for each `cluster` diff_genes = adata.uns['rank_genes_groups']['names'] # gene names gene_list: list = list(adata.var.index) gene_list.sort() gene_dict: dict = dict(zip(gene_list, range(len(gene_list)))) # obs cluster_info: DataFrame = adata.obs.copy().groupby([cluster], as_index=False).size() cluster_info.index = cluster_info[cluster].astype(str) cluster_list: list = cluster_info.index.tolist() cluster_list.sort() _shape_ = (adata.shape[1], cluster_info.shape[0]) diff_genes_score_matrix: matrix_data = np.zeros(_shape_) diff_genes_p_value_matrix: matrix_data = np.zeros(_shape_) diff_genes_adjusted_p_value_matrix: matrix_data = np.zeros(_shape_) diff_genes_log2_fold_change_matrix: matrix_data = np.zeros(_shape_) del _shape_ # cluster for _cluster_ in cluster_list: ul.log(__name__).info(f"Obtaining differentially expressed genes for `cluster` ({_cluster_}).") # obtain cluster difference gene data _cluster_data_: DataFrame = sc.get.rank_genes_groups_df(adata, group=_cluster_) _cluster_index_: int = cluster_list.index(_cluster_) # Add data value for _gene_name_, _score_, _p_value_, _adjusted_p_value_, _log2_fold_change_ in zip( _cluster_data_["names"], _cluster_data_["scores"], _cluster_data_["pvals"], _cluster_data_["pvals_adj"], _cluster_data_["logfoldchanges"] ): _gene_index_: int = gene_dict[_gene_name_] diff_genes_score_matrix[_gene_index_, _cluster_index_] = 0 if np.isnan(_score_) else _score_ diff_genes_p_value_matrix[_gene_index_, _cluster_index_] = 1 if np.isnan(_p_value_) else _p_value_ diff_genes_adjusted_p_value_matrix[_gene_index_, _cluster_index_] = 1 if np.isnan(_adjusted_p_value_) else _adjusted_p_value_ diff_genes_log2_fold_change_matrix[_gene_index_, _cluster_index_] = 0 if np.isnan(_log2_fold_change_) else _log2_fold_change_ del _cluster_data_, _cluster_index_ set_inf_value(diff_genes_score_matrix) set_inf_value(diff_genes_p_value_matrix) set_inf_value(diff_genes_adjusted_p_value_matrix) set_inf_value(diff_genes_log2_fold_change_matrix) diff_genes_p_value_matrix[diff_genes_p_value_matrix == 0] = np.min(diff_genes_p_value_matrix[diff_genes_p_value_matrix != 0]) diff_genes_adjusted_p_value_matrix[diff_genes_adjusted_p_value_matrix == 0] = np.min(diff_genes_adjusted_p_value_matrix[diff_genes_adjusted_p_value_matrix != 0]) # create diff_genes_adata: AnnData = AnnData(diff_genes_score_matrix, obs=adata.var, var=cluster_info) diff_genes_adata.layers["p_value"] = diff_genes_p_value_matrix diff_genes_adata.layers["adjusted_p_value"] = diff_genes_adjusted_p_value_matrix diff_genes_adata.layers["log2_fold_change"] = diff_genes_log2_fold_change_matrix # Add diff_genes diff_genes_adata.uns["diff_genes"] = diff_genes if diff_genes_file is not None: save_h5ad(diff_genes_adata, diff_genes_file) return diff_genes_adata 1.4.2 Gene pathway enrichment analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We performed gene pathway enrichment analysis (GPEA) using the top ``50`` genes ranked by ``scores`` within the differentially activity genes of each cell type. This pipeline was implemented using the GSEApy package, with annotated gene sets including Gene Ontology, KEGG pathways, and GWAS Catalog. .. code-block:: python # -*- coding: UTF-8 -*- import warnings from typing import Optional, Union, Literal import numpy as np import pandas as pd from tqdm import tqdm from anndata import AnnData from pandas import DataFrame from .. import util as ul from ..util import check_adata_get, to_dense __name__: str = "preprocessing_gsea" _Datasets = Optional[Literal['Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm']] def gsea_enrichr( gene_list: list[str], gene_sets: Union[list[str], set] = ( "GO_Biological_Process_2023", "GO_Cellular_Component_2023", "GO_Molecular_Function_2023", "GWAS_Catalog_2023", "KEGG_2016" ), organism: _Datasets = "human", is_verbose: bool = True, output_dir: Optional[str] = None ) -> DataFrame: import gseapy as gp # noinspection PyTypeChecker names = gp.get_library_name(organism) if not set(gene_sets).issubset(set(names)): ul.log(__name__).error(f"The set of the {gene_sets} needs to include `gp.get_library_name(organism)`") raise ValueError(f"The set of the {gene_sets} needs to include `gp.get_library_name(organism)`") if output_dir is not None: ul.file_method(__name__).makedirs(output_dir) if is_verbose: ul.log(__name__).info("GSEA enrichr.") # noinspection PyTypeChecker gsea = gp.enrichr(gene_list=gene_list, gene_sets=list(gene_sets), organism=organism, outdir=output_dir) return gsea.results def get_gene_enrichment( adata: AnnData, top_number: int = 50, threshold: float = 0.01, layer: Optional[str] = None, is_order_or_lt: bool = True, is_top: bool = True, gene_sets: Union[list[str], set] = ( "GO_Biological_Process_2023", "GO_Cellular_Component_2023", "GO_Molecular_Function_2023", "GWAS_Catalog_2023", "KEGG_2016" ), organism: _Datasets = "human", output_dir: Optional[str] = None, ) -> DataFrame: ul.log(__name__).info("Gene enrichment analysis.") # get data new_adata = check_adata_get(adata, layer=layer) enrichr_data_list = [] cluster_list: list[str] = list(new_adata.var.index) gene_list: list = list(new_adata.obs_names) if is_top and len(gene_list) < top_number: ul.log(__name__).warning(f"The number of parameters `top_number` ({top_number}) is greater than the number of genes ({len(gene_list)}).") top_number = len(gene_list) # Add data for cluster in tqdm(cluster_list): # get index _values_ = to_dense(new_adata[:, cluster].X, is_array=True).flatten() if is_top: _index_ = np.argsort(_values_)[0:top_number] if is_order_or_lt else np.argsort(_values_)[-top_number:] _cluster_gene_list_: list[str] = list(np.array(gene_list)[_index_]) else: _index_ = _values_ <= threshold if is_order_or_lt else _values_ >= threshold _cluster_gene_list_: list[str] = list(np.array(gene_list)[_index_]) if len(_cluster_gene_list_) == 0: ul.log(__name__).warning(f"The gene list for cluster {cluster} is empty.") continue with warnings.catch_warnings(): warnings.simplefilter("ignore") enrichr_data: DataFrame = gsea_enrichr( gene_list=_cluster_gene_list_, gene_sets=gene_sets, organism=organism, is_verbose=False, output_dir=output_dir ) enrichr_data["cluster"] = cluster enrichr_data_list.append(enrichr_data) del _values_, _index_, _cluster_gene_list_ data = pd.concat(enrichr_data_list, axis=0) return data