1. scVMAP process

scVMAP: https://bio.liclab.net/scvmap/

scVMAP reproducibility: https://github.com/YuZhengM/scvmap_reproducibility

scVMAP tutorial: https://scvmap.readthedocs.io/en/latest/

scVMAP front-end: https://github.com/YuZhengM/scvmap_web

scVMAP back-end: https://github.com/YuZhengM/scvmap

scVMAP API: https://bio.liclab.net/scvmap_service/scvmap.html

scVMAP gathers and provides scATAC-seq datasets with established cell type labels (from scATAC-Ref, GreenleafLab and PlaqView) and cell type labels annotated by singleR (from scBlood), along with detailed fine-mapping data (from CAUSALdb2, FinnGen, UKBB and BBJ). scVMAP, by employing the SCAVENGE and g-chromVAR methods, assigns trait-relevant scores at single-cell resolution, uncovering biologically relevant cell types.

The core original intention and purpose of scVMAP is to provide a resource of trait-relevant significant cell types, offering researchers interested in this objective a convenient and fast retrieval interface. It uses widely recognized methods to provide reference for key variants, TFs, and genes involved in relevant potential enrichment mechanisms, offering researchers a direction for reference, reducing their time, and making a modest contribution to the scientific research community.

Throughout the process, data undergoes standardization, unified processing, and comprehensive analysis. The main contents are as follows:

Single-cell samples:
- Differential gene activity analysis of cell types
- Differential TF activity analysis of cell types
- High-scoring gene pathway enrichment analysis of cell types
Traits or diseases:
- LD-based MAGMA analysis (Trait-relevant genes)
- HOMER motif enrichment analysis (Trait-relevant TFs)
- MAGMA-based trait-relevant gene pathway enrichment analysis
Integrated analysis:
- ATAC-based Cicero analysis (Trait-relevant genes)
- ATAC-based GimmeMotifs analysis (Trait-relevant genes)
- Gene hub regulatory network analysis
- TF hub regulatory network analysis
- Integrated regulatory network analysis

The scVMAP database primarily provides the following functions: Data-browse, Search, Analysis, Statistics, Download, Contact, and Online analysis. These functionalities streamline the process of studying single-cell genetic regulation, enabling users to easily explore and interpret biological mechanisms in specific contexts.

In conclusion, (scVMAP, https://bio.liclab.net/scvmap/) is a user-centric database that facilitates the exploration and interpretation of single-cell genetic data through structured displays and dynamic visualizations. It offers intuitive workflows, customizable parameters and comprehensive data accessibility.

Tip

This section provides a comprehensive explanation of the entire pipeline for the scVMAP platform, encompassing the stages of data acquisition, curation, analysis, and the ultimate platform realization.

1.1 Data processing methods

We provide the software and version information required for data processing in the scVMAP platform, along with the corresponding code to reproduce the results.

Method	Version	Link
g-chromVAR	0.3.2	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/R
SCAVENGE	1.0.2	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/R
LiftOver		https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/liftOver
SnapATAC2	2.7.1	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2
SCANPY	1.11.5	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2
GSEApy	1.1.9	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/SnapATAC2
chromVAR	1.24.0	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/chromVAR
Cicero	1.22.0	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/scATAC/cicero
MAGMA	1.10	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/magma
BEDTools	2.31.1	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/interaction
HOMER	5.1	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/homer
GimmeMotifs	0.18	https://github.com/YuZhengM/scvmap_reproducibility/tree/main/variant/gimme

1.2 Data collection and curation

The scVMAP platform collects scATAC-seq data and fine-mapping result data.

1.2.1 Collection of scATAC-seq data

The collected scATAC-seq data mainly comes from scATAC-Ref, scBlood, GreenleafLab, and PlaqView sources.

A total of 183 single-cell sample data were collected, with detailed information as follows:

Download url: sample_info_with_age_sex_drug.txt

Note

When preprocessing scATAC-seq data through SnapATAC2, only the min_tsse parameter varies for different samples (Excluding input files and reference genome information).

1.2.1.1 scATAC-Ref

Source url: https://bio.liclab.net/scATAC-Ref/

We downloaded human data from scATAC-Ref and retained a total of 28 scATAC-seq datasets. We integrated all GSM samples from GSE195460 in scATAC-Ref to create the dataset “sample_id_30 (GSE195460_diabetic_kidney)”.

Detailed processing steps:

Download the RDS files.
Standardize the filenames to the format: <sample_label>_ATAC.rds.
Generate the metadata files: barcodes.tsv, matrix.mtx, and peaks.bed.
Generate the <sample_label>_fragments.tsv.gz file.
Perform preprocessing using the SnapATAC2 software with the following specific parameters:

min_num_fragments = 200,
bin_size = 500,
min_tsse = <min_tsse>,
need_features = 500000,
min_cells = 5,

data = snap.pp.import_data(
    fragment_file="<sample_label>_fragments.tsv.gz",
    chrom_sizes=<genome_anno>,
    file=<h5ad_file>,
    min_num_fragments=min_num_fragments,
    sorted_by_barcode=False
)

# the standard procedures to add tile matrices, select features, and identify doublets
snap.metrics.tsse(data, <genome_anno>)

snap.pp.filter_cells(data, min_tsse=min_tsse)
snap.pp.add_tile_matrix(data, bin_size=bin_size)
snap.pp.select_features(data, n_features=need_features)

snap.pp.scrublet(data, features=features)
snap.pp.filter_doublets(data)

Please see scVMAP-reproducibility-SnapATAC2 for the detailed workflow code.

Note

The cell type labels were annotated based on the original publications of each scATAC-seq sample and are highly reliable.

For all single-cell samples except “sample_id_20” (Alzheimer’s Disease (AD)), we utilized the uniform manifold approximation and projection (UMAP) coordinates from their original collection sources for clustering. Due to the poor clustering performance of the original UMAP coordinates for “sample_id_20”, we processed the binarized sparse counts matrix using SnapATAC2, converted it to a term frequency-inverse document frequency (TF-IDF) matrix, and subsequently extracted representative low-dimensional features through latent semantic indexing (LSI). Finally, we performed UMAP on this low-dimensional feature matrix to obtain the coordinates for “sample_id_20”.

1.2.1.2 scBlood

Source url: https://bio.liclab.net/scBlood/

We downloaded human data from scBlood and retained a total of 152 scATAC-seq datasets.

The processing pipeline is identical to that of scATAC-Ref.

Note

The cell type labels were annotated with SingleR. Their reliability should be treated as indicative.

1.2.1.3 GreenleafLab

Source url: https://github.com/GreenleafLab/MPAL-Single-Cell-2019

We downloaded a scATAC-seq dataset for PBMC.

The processing pipeline is identical to that of scATAC-Ref.

1.2.1.4 PlaqView

Source url: https://www.plaqview.com/

We downloaded a scATAC-seq dataset for coronary artery disease (CAD).

The processing pipeline is identical to that of scATAC-Ref.

1.2.1.5 Summary

Here are the specific parameter settings for min_tsse.

Sample ID	min_tsse
sample_id_1-30	8
sample_id_31-183	5

The scATAC-seq data is obtained through download page.

Cell type annotations were directly assigned from their original articles, whereas the scATAC-seq samples obtained from scBlood were annotated using the SingleR software.

Besides cell type annotation, we also performed annotation for age, sex, and drug resistance, involving 24, 19, and 2 samples, respectively. It can be viewed via the browser page.

1.2.2 Collection of trait data

FINEMAP fine-mapping result download url (15805): trait_info.xlsx

SuSiE fine-mapping result download url (79): trait_info_susie.xlsx

1.2.2.1 FINEMAP

The collected FINEMAP fine-mapping result data comes from CAUSALdb2, UKBB, FinnGen, and BJJ sources.

Source ID	Source name	Source author, year	Source description	PMID	Source genome	Trait count	Filter count	Source link	Source version
source_id_1	CAUSALdb	Jianhua Wang et al., 2024	CAUSALdb fine-mapping results	31691819	hg19	15038	14417	http://www.mulinlab.org/causaldb/index.html	Release 2.1
source_id_2	UKBB	Wang, Q.S. et al., 2021	UKBB fine-mapping results	—	hg19	94	94	https://www.finucanelab.org/data	Release 1.1
source_id_3	FinnGen	Kurki, M.I. et al., 2023	FinnGen fine-mapping results	36653562	hg38	1234	1215	https://www.finngen.fi/en/access_results	R11
source_id_4	BBJ	Kanai, M. et al., 2021	BBJ fine-mapping results	—	hg19	79	79	https://humandbs.dbcls.jp/en/hum0197-v18#hum0197.v5.gwas.v1	—

For each trait, we retained the variants with a causal variant probability value (PP) > 0.001 calculated by FINEMAP. As a result, we retained 15,805 traits from the initial 16,445 traits and used them for subsequent analysis.

1.2.2.2 SuSiE

The collected 79 SuSiE fine-mapping results (PP > 0.001) come from BJJ source.

1.2.2.3 Summary

The trait data is obtained through download page.

To harmonize genomic coordinates between variants and scATAC-seq data, we performed LiftOver to convert variant positions to match the reference genome version used in single-cell analysis. Next, we manually categorized them into a broad array of classifications, including diseases, indicators, drugs, chemical compounds, health care, treatments, therapies and symptoms. For the disease category, we annotated diseases according to ICD-10, encompassing all 22 major disease categories, and further subclassifying them into more than 250 specific subcategories to provide an intuitive and convenient reference.

Please see scVMAP-reproducibility-Trait for the detailed workflow code.

It can be viewed via the browser page.

Tip

Currently, SuSiE fine-mapping results can only be viewed through the “Data-browse” interface. Other interfaces, such as “Search” and “Analysis”, do not retrieve SuSiE data and are limited to FINEMAP results. Support for SuSiE data in these interfaces will be added in a future update. In addition, we will gradually expand the data of SuSiE method and other fine mapping methods.

1.3 Variant-function-mapping at single cell resolution

Our preferred method for calculating trait relevance scores (TRSs) is SCAVENGE. Since the SCAVENGE method utilizes the g-chromVAR approach, scVMAP supports both g-chromVAR and SCAVENGE for computing TRSs.

Specific process code: scVMAP-R

1.3.1 g-chromVAR

Building upon the original g-chromVAR codebase, we have addressed the issues caused by NA/INF values and refined the corresponding implementation.

1.3.2 SCAVENGE

The values of some parameter settings.

Hyper parameter	Value
Dimensionality of latent semantic indexing (LSI)	30
Number of neighbors in mutual k-nearest neighbors (M-kNN)	30
Restart value for random walk	0.05

1.3.3 Code for using scVMAP

We have consolidated the SCAVENGE workflow into a single command. You can view the code here: scVMAP-exec_R_code

The specific implementation process is as follows:

1.3.3.1 Install R packages

Refer to file scVMAP-install_R_packages for installation.

1.3.3.2 Download files

The following two files need to be downloaded through link scVMAP-exec_R_code:

integration.R: A file for g-chromVAR and SCAVENGE algorithm code.
run.R: An R script for executing the g-chromVAR and SCAVENGE algorithms via Rscript in the Linux terminal.

Tip

The online analysis function of scVMAP uses this script command.

1.3.3.3 Create formatted directories

Create a root path: /project/scVMAP.

Other path details:

/project/scVMAP/code: Path for the integration.R and run.R files.

/project/scVMAP/result: Path for saving intermediate files and result files.

/project/scVMAP/scATAC: This path stores the input scATAC-seq data in RDS format. File example: scATAC-seq-example.

/project/scVMAP/variant: This path stores the input phenotype data in BED or TXT format. File example: Trait-example.

1.3.3.4 Execute command

The command is as follows:

/bin/Rscript $basePath/code/run.R $basePath $identifier $scFile $variantFile $genome $layer

For example:

/bin/Rscript /project/scVMAP/code/run.R /project/scVMAP 7627190552 26286db074_GSE139369_ELM_sim_0.7_ATAC.rds 4880d7db5c_BBJ_RBC_64.bed hg19 counts

Parameter description:

core_process($basePath, $identifier, $scFile, $variantFile, $genome, $layer);

Command-line Parameter Description

Parameters:

$basePath (string) – The root path for processing.
$identifier (string) – A unique ID used as the filename to save the results.
$scFile (string) – The filename of the scATAC-seq data.
$variantFile (string) – The filename of the trait file.
$genome (string) – The reference genome (hg19 or hg38).
$layer (string) – The name of the layer for the counts matrix in the RDS file, e.g., “counts”.

Return type:

None

1.4 Differential gene activity and pathway enrichment analysis across cell types

Please see scVMAP-reproducibility-SnapATAC2 for the detailed workflow code.

1.4.1 Differential activity genes of cell types

We used SnapATAC2 to calculate gene activity data for single cells. Then, we calculated differentially active genes across cell types using SCANPY.

Here is the SnapATAC2 code content:

# Create the cell by gene activity matrix
# The `adata` represents the result preprocessed by SnapATAC2.
selected_list = np.array(list(adata.var["selected"]))
selected_adata = adata[:, selected_list]
gene_matrix = snap.pp.make_gene_matrix(selected_adata, genome_anno)

# normalize
sc.pp.filter_genes(gene_matrix, min_cells=min_cells)
sc.pp.normalize_total(gene_matrix)
sc.pp.log1p(gene_matrix)
sc.external.pp.magic(gene_matrix, solver="approximate")

Here is the SCANPY code content:

def get_difference_genes(
    adata: AnnData,
    cluster: str,
    method: _Method = "wilcoxon",
    cell_anno: Optional[DataFrame] = None,
    diff_genes_file: Optional[str] = None
) -> AnnData:

    import scanpy as sc

    # add cell annotation information
    adata.obs = add_cluster_info(adata.obs, cell_anno, cluster)

    if "log1p" not in adata.uns_keys():
        adata.uns["log1p"] = {}

    if "base" not in adata.uns["log1p"].keys():
        adata.uns["log1p"].update({"base": None})

    # gene
    ul.log(__name__).info("Rank genes for characterizing groups.")
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        sc.tl.rank_genes_groups(adata=adata, groupby=cluster, method=method, use_raw=False)

    # get difference genes for each `cluster`
    diff_genes = adata.uns['rank_genes_groups']['names']

    # gene names
    gene_list: list = list(adata.var.index)
    gene_list.sort()
    gene_dict: dict = dict(zip(gene_list, range(len(gene_list))))

    # obs
    cluster_info: DataFrame = adata.obs.copy().groupby([cluster], as_index=False).size()
    cluster_info.index = cluster_info[cluster].astype(str)

    cluster_list: list = cluster_info.index.tolist()
    cluster_list.sort()

    _shape_ = (adata.shape[1], cluster_info.shape[0])
    diff_genes_score_matrix: matrix_data = np.zeros(_shape_)
    diff_genes_p_value_matrix: matrix_data = np.zeros(_shape_)
    diff_genes_adjusted_p_value_matrix: matrix_data = np.zeros(_shape_)
    diff_genes_log2_fold_change_matrix: matrix_data = np.zeros(_shape_)
    del _shape_

    # cluster
    for _cluster_ in cluster_list:
        ul.log(__name__).info(f"Obtaining differentially expressed genes for `cluster` ({_cluster_}).")
        # obtain cluster difference gene data
        _cluster_data_: DataFrame = sc.get.rank_genes_groups_df(adata, group=_cluster_)
        _cluster_index_: int = cluster_list.index(_cluster_)

        # Add data value
        for _gene_name_, _score_, _p_value_, _adjusted_p_value_, _log2_fold_change_ in zip(
            _cluster_data_["names"],
            _cluster_data_["scores"],
            _cluster_data_["pvals"],
            _cluster_data_["pvals_adj"],
            _cluster_data_["logfoldchanges"]
        ):
            _gene_index_: int = gene_dict[_gene_name_]
            diff_genes_score_matrix[_gene_index_, _cluster_index_] = 0 if np.isnan(_score_) else _score_
            diff_genes_p_value_matrix[_gene_index_, _cluster_index_] = 1 if np.isnan(_p_value_) else _p_value_
            diff_genes_adjusted_p_value_matrix[_gene_index_, _cluster_index_] = 1 if np.isnan(_adjusted_p_value_) else _adjusted_p_value_
            diff_genes_log2_fold_change_matrix[_gene_index_, _cluster_index_] = 0 if np.isnan(_log2_fold_change_) else _log2_fold_change_

        del _cluster_data_, _cluster_index_

    set_inf_value(diff_genes_score_matrix)
    set_inf_value(diff_genes_p_value_matrix)
    set_inf_value(diff_genes_adjusted_p_value_matrix)
    set_inf_value(diff_genes_log2_fold_change_matrix)

    diff_genes_p_value_matrix[diff_genes_p_value_matrix == 0] = np.min(diff_genes_p_value_matrix[diff_genes_p_value_matrix != 0])
    diff_genes_adjusted_p_value_matrix[diff_genes_adjusted_p_value_matrix == 0] = np.min(diff_genes_adjusted_p_value_matrix[diff_genes_adjusted_p_value_matrix != 0])

    # create
    diff_genes_adata: AnnData = AnnData(diff_genes_score_matrix, obs=adata.var, var=cluster_info)
    diff_genes_adata.layers["p_value"] = diff_genes_p_value_matrix
    diff_genes_adata.layers["adjusted_p_value"] = diff_genes_adjusted_p_value_matrix
    diff_genes_adata.layers["log2_fold_change"] = diff_genes_log2_fold_change_matrix

    # Add diff_genes
    diff_genes_adata.uns["diff_genes"] = diff_genes

    if diff_genes_file is not None:
        save_h5ad(diff_genes_adata, diff_genes_file)

    return diff_genes_adata

1.4.2 Gene pathway enrichment analysis

We performed gene pathway enrichment analysis (GPEA) using the top 50 genes ranked by scores within the differentially activity genes of each cell type.

This pipeline was implemented using the GSEApy package, with annotated gene sets including Gene Ontology, KEGG pathways, and GWAS Catalog.

# -*- coding: UTF-8 -*-

import warnings
from typing import Optional, Union, Literal

import numpy as np
import pandas as pd
from tqdm import tqdm
from anndata import AnnData
from pandas import DataFrame

from .. import util as ul
from ..util import check_adata_get, to_dense

__name__: str = "preprocessing_gsea"

_Datasets = Optional[Literal['Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm']]


def gsea_enrichr(
    gene_list: list[str],
    gene_sets: Union[list[str], set] = (
        "GO_Biological_Process_2023",
        "GO_Cellular_Component_2023",
        "GO_Molecular_Function_2023",
        "GWAS_Catalog_2023",
        "KEGG_2016"
    ),
    organism: _Datasets = "human",
    is_verbose: bool = True,
    output_dir: Optional[str] = None
) -> DataFrame:

    import gseapy as gp

    # noinspection PyTypeChecker
    names = gp.get_library_name(organism)

    if not set(gene_sets).issubset(set(names)):
        ul.log(__name__).error(f"The set of the {gene_sets} needs to include `gp.get_library_name(organism)`")
        raise ValueError(f"The set of the {gene_sets} needs to include `gp.get_library_name(organism)`")

    if output_dir is not None:
        ul.file_method(__name__).makedirs(output_dir)

    if is_verbose:
        ul.log(__name__).info("GSEA enrichr.")

    # noinspection PyTypeChecker
    gsea = gp.enrichr(gene_list=gene_list, gene_sets=list(gene_sets), organism=organism, outdir=output_dir)
    return gsea.results


def get_gene_enrichment(
    adata: AnnData,
    top_number: int = 50,
    threshold: float = 0.01,
    layer: Optional[str] = None,
    is_order_or_lt: bool = True,
    is_top: bool = True,
    gene_sets: Union[list[str], set] = (
        "GO_Biological_Process_2023",
        "GO_Cellular_Component_2023",
        "GO_Molecular_Function_2023",
        "GWAS_Catalog_2023",
        "KEGG_2016"
    ),
    organism: _Datasets = "human",
    output_dir: Optional[str] = None,
) -> DataFrame:
    ul.log(__name__).info("Gene enrichment analysis.")
    # get data
    new_adata = check_adata_get(adata, layer=layer)

    enrichr_data_list = []

    cluster_list: list[str] = list(new_adata.var.index)
    gene_list: list = list(new_adata.obs_names)

    if is_top and len(gene_list) < top_number:
        ul.log(__name__).warning(f"The number of parameters `top_number` ({top_number}) is greater than the number of genes ({len(gene_list)}).")
        top_number = len(gene_list)

    # Add data
    for cluster in tqdm(cluster_list):

        # get index
        _values_ = to_dense(new_adata[:, cluster].X, is_array=True).flatten()

        if is_top:
            _index_ = np.argsort(_values_)[0:top_number] if is_order_or_lt else np.argsort(_values_)[-top_number:]
            _cluster_gene_list_: list[str] = list(np.array(gene_list)[_index_])
        else:
            _index_ = _values_ <= threshold if is_order_or_lt else _values_ >= threshold
            _cluster_gene_list_: list[str] = list(np.array(gene_list)[_index_])

        if len(_cluster_gene_list_) == 0:
            ul.log(__name__).warning(f"The gene list for cluster {cluster} is empty.")
            continue

        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            enrichr_data: DataFrame = gsea_enrichr(
                gene_list=_cluster_gene_list_,
                gene_sets=gene_sets,
                organism=organism,
                is_verbose=False,
                output_dir=output_dir
            )

        enrichr_data["cluster"] = cluster
        enrichr_data_list.append(enrichr_data)

        del _values_, _index_, _cluster_gene_list_

    data = pd.concat(enrichr_data_list, axis=0)

    return data