JASPAR 2022 motif discovery pipeline and motif annotation

Summary

JASPAR (1) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups: Vertebrates, Plants, Fungi, Insects and Urochordata.

For JASPAR 2022 release we curated motifs derived from the following sources:

Human TF dimers motifs derived from SELEX (2)
Human TF motifs detected by SELEX in free DNA (3)
Human TF motifs detected by SELEX in nucleosomal DNA (3)
Human Zinc finger TF motifs derived from ChIP-seq (4)
Human Zinc finger TF motifs derived from ChIP-exo (4)
Ciona robusta motifs derived from SELEX (5)
Yeast motifs derived from ChIP-exo (6)

Motif discovery pipeline

Software required

Regulatory sequences analysis tools RSAT (7)
R >= 3.6
Perl TFBS modules: http://tfbs.genereg.net/
snakemake: https://snakemake.readthedocs.io/en/stable/index.html
pdfunite

Configuration file

Set the following variables in the config file:

Software variables

Assuming that you are in the JASPAR_2022_motif_discovery_and_curation_pipeline folder:

bin : bin
python : python3.6
RSAT : /lsc/rsat

You can see where RSAT program are using the following command: echo $RSAT

Organism specific variables

data_folder : This is the folder where the input file provided as narrowPeak must be placed. See Expected folder structure for more information about the folder and file names.
genome_fasta: The path to the .fa file with the organism’s genome sequence.
genome_size: the path to the .chrom.sizes file.
TF_Experiment_map: The path to the experiment table. The pipeline expects at least three columns in the following order: 1) Experiment ID, 2) Condition/Cell-type, and 3) TF.
out_dir: The path to results folder where all the analysis will be stored, one folder per experiment. Don’t forget to indicate the genome in the path. Example: ReMap2020/Hs_hg38

Dowload genomes and chromosome size file

sacCer3 (UCSC)

Download genome from UCSC through command-line:

## Download the twoBitToFa tool for format conversion
## It is not necessary to download twoBitToFa every time you download a genome
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa 
chmod +x twoBitToFa

## Download chromosome sizes file
wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.chrom.sizes

## Download the 2bit version of sacCer3 genome
wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/sacCer3.2bit

## Get the fasta file
./twoBitToFa sacCer3.2bit sacCer3.fa

Once the .fa and .chrom.sizes were downloaded, set the genome_fasta and genome_size variables in the corresponding config.yaml file

Launching the pipeline

Expected folder structure

The Snakefile will launch a motif analysis + centrality enrichment test for each discovered motif of each dataset (experiment). To launch the pipeline we only required a narrowPeak file that must be located in the ‘data_folder’ specified in the config file (see configuration file section).

Example:

cd data_folder

.
├── ENCSR000BST_GATA3_MCF-7
│   └── ENCSR000BST_GATA3_MCF-7_peaks.narrowPeak
└── ENCSR000ATT_FOXA1_MCF-7
  └── ENCSR000ATT_FOXA1_MCF-7_peaks.narrowPeak

Every EXPERIMENT_FOLDER in the data folder must contain a narrowPeak file with the same name and the suffix _peaks, see tree above. Example:

EXPERIMENT_FOLDER = ENCSR000BST_GATA3_MCF-7
narrowPeak        = ENCSR000BST_GATA3_MCF-7_peaks.narrowPeak

This is the data structure required by the pipeline.

Launching the snakemake workflow

There are two main task/steps in this pipeline:

Step 1: motif discovery. The program RSAT peak-motifs (8) is called, this program runs four motif discovery algorithms (over-represented k-mers, over-represented spaced-k-mers, positionally biased k-mers, k-mers enriched in central windows) to discover significant motifs on each set of ChIP-seq peaks. This can be launched by running the Snakefile, you can parallelize the process using the argument –cores.

The Snakefile contain a series of rules, for the motif discovery steps, it searches for datasets in narrowPeak format.

Step 2: selecting centrally enriched motifs. Note that given the ChIP-seq dataset quality and the number of peaks may vary among the datasets, many datasets may not produce any significant motif. A priori, the number of discovered motifs is unknown, so we use checkpoints within the Snakefile.

Selecting config file

This pipeline is generic for any narrowPeak file. In order to use a single pipeline for data coming in multiple input format, we use multiple configuration files (folder config_files) containing the configuration files for multiple organisms and multiple ChIP-seq repositories (e.g., ReMap (9), GTRD (10), ChIP-atlas (11), CistromeDB (12)).

In order to run the pipeline, users must enter one of the following analysis IDs:

ReMap 2020 Homo sapiens hg38 : ReMap2020_Human
ChIP-exo S cerevisiae peaks from ChExMix : ChExMix_sacCer3

snakemake --config analysis_id=ReMap2022_Human [--cores X]

All the snakemake rules in the pipeline, except the first one (extract_peak_summits) are the same. The reason is because the input genomic coordinates could vary across the source and may require to parse the table.

In the previous chunk, the argument –config analysis_id=ReMap2022_Human indicates to the pipeline to use the config file for ReMap 2020 Homo sapiens hg38, and run its respective extract_peak_summits rule.

Troubleshooting

rule RSAT_peakmotifs_per_exp

RSAT peak-motifs none motif found: verify that the input NarrowPeak is not empty.
RSAT local-word-analysis python version: this program is written in python and called within RSAT peak-motifs. In case that the default python environment is 3.5, local-word-analysis will not run, therefore the solution is modify directly the line in the $RSAT/perl-scripts/peak-motifs script adding “python2.7” before the program is called.
RSAT position-analysis founds highly repetitive/uninformative motifs: The default interval for position-analysis is 50nt, which is too large for the JASPAR pipeline (where the peaks have a size of 101), this may produce non-relevant artifact motifs. By changing the interval to 25 the program may found the expected TF motif. Note that this change was already set in the config file.

rule annotate_best_centrimo_experiment

Verify that the variables in the awk command correspond to the correct fields in your experiment table.

rule best_centrimo_experiment_logo

Input files not found: this may occur when the Experiment_TF ID contains certain characters as: (, ), [, ], ., /, " "(one or more spaces). Before launching the pipeline, verify that none of the folders in your data folder contains such characters, the simplest option is remove them or substitute them by ‘-’.

Memory requirements

If you launch the Snakefile using multiple cores (See Launching the snakemake* workflow* section), each core will use ~1GB of memory in your computer.

Collecting and preparing peaks datasets

Collecting and preparing ChIP-exo peaks derived from ChExMix in yeast (sacCer3)

ChIP-exo datasets were kindly provided by Shaun Mahony derived from the ChExMix peak-caller (6).

The ChIP-exo peaks were parsed to the required format using the following script:

Command:

peaks_in_folder : directory containing the ChExMix peaks
peaks_out_folder : directory that will containg the parsed ChExMix peaks
suffix : suffix for the putput files (*_peaks.narrowPeak*)


## For ChIP-seq derived motifs
Rscript ChExMix_sacCer3/pre-process_scripts/Preprocess_chexmix_peaks.R \
--peaks_in_folder ChExMix_sacCer3/data/yep-peaks/ \
--peaks_out_folder ChExMix_sacCer3/data/Yeast_SacCer3 \
--suffix '_peaks.narrowPeak'

This script will pre-process the ChExMix peaks in the format required by this pipeline and will also produce the TF - Experiment ID table that is required by this pipeline to annotate the motifs.

Required software:

R packages:
- data.table
- dplyr
- optparse

Collecting and preparing motifs for curation

Pre-processing of human zinc finger motifs from Dogan 2020

The ChIP-seq and ChIP-exo motif libraries were downloaded from the supplementary material from (4).

Each motif comes in a separated file in cis-bp motif format. The file name corresponds to their uniprot ID.

Run the following command to map the correct TF name from the uniprot ID and convert the motifs from cis-bp to transfac format, the latter allows to add annotations and comments to the motifs.

Download and unzip the each zip file in a separated folder.

Command:

motif_folder : directory containing the cis-bp motifs
logo_folder : directory where the logos will be stored
motif_source : prefix for naming the result transfac file containing all the motifs


## For ChIP-seq derived motifs
Rscript Zinc_Fingers_C2H2_Dogan_2020/pre-process/Rename_C2H2_motifs.R \
--motif_folder Zinc_Fingers_C2H2_Dogan_2020/data/Zinc_fingers_C2H2_ChIP-seq \
--logo_folder Zinc_Fingers_C2H2_Dogan_2020/data/Zinc_fingers_C2H2_ChIP-seq/Logos \
--motif_source ChIP-seq


## For ChIP-exo derived motifs
Rscript Zinc_Fingers_C2H2_Dogan_2020/pre-process/Rename_C2H2_motifs.R \
--motif_folder Zinc_Fingers_C2H2_Dogan_2020/data/Zinc_fingers_C2H2_ChIP-exo \
--logo_folder Zinc_Fingers_C2H2_Dogan_2020/data/Zinc_fingers_C2H2_ChIP-exo/Logos \
--motif_source ChIP-exo

Required software:

RSAT
R packages:
- data.table
- dplyr
- optparse
- UniprotR

Pre-processing of Ciona robusta motifs derived from SELEX

The motifs were downloaded from the Aniseed database, in the download section.

Download the following files:

Selex seq section (SELEX-seq best round 6-mer enrichments 4.5 MB)
SELEX-seq table S1-S3, Nitta et al. 2019 2.7 MB

The first file contains the best PFM of each TF. The second file contains the annotation of the TFs (IDs, name, family, interprot ID), we extracted the first sheet and removed the first 2 lines.

Command:

input.motif.folder : directory containing the Aniseed motifs
output.motif.folder : directory where the logos and pfms will be stored
annotation.table : Aniseed annotation table
organism : Ciona_robusta (suffix to name output files)


## For Aniseed (Ciona robusta) SELEX motifs
Rscript Aniseed/pre-process_scripts/Rename_Aniseed_motifs.R \
--input.motif.folder Aniseed/data/Best_round/pfm_best_round \
--output.motif.folder Aniseed/data/Best_round \
--annotation.table Aniseed/data/Best_round/SELEX_project_TableS1_S3_Nitta_et_al_2019.csv \
--organism ChIP-Ciona_robusta

Required software:

RSAT
R packages:
- data.table
- dplyr
- optparse

Pre-processing of CAP and NCAP SELEX motifs

These motifs were directly taken from the RSAT website, in the motif databases section.

Pre-processing of human TF dimer SELEX motifs

These motifs were directly taken from the RSAT website, in the motif databases section.

After manually curation

Trim low-informative positions

Once the motifs were manually curated, in many cases is useful to trim non-informative positions, we provide a script to manually trim the positions.

Required software:

R packages:
- universalmotif
- optparse

Supported motif formats:

jaspar

Other motifs formats will be added if they are required in the pipeline.

Parameters:

  -m : (--input_motif)       Path to input motif file. (Mandatory)
  -n : (--format)            Input motif format. (Mandatory)
  -o : (--output_directory)  Output directory to export the trimmed motifs. If not indicated, the trimmed motifs are exported in the same folder with the extension *.trimmed*
  -b : (--both)              Trim b nucleotides in both sides 
  -l : (--left)              Trim l nucleotides in left side
  -r : (--right)             Trim r nucleotides in right side
  -f : (--from)              Keep nucleotides starting from f position 
  -t : (--to)                Keep nucleotides until t position

Example:

Rscript bin/Motif_Friseur.R         \
  -m Doc/ZNF506.jaspar              \
  -n jaspar                         \
  -o examples/results/Motif_friseur \
  -f 4                              \
  -t 12

Trimming the ZNF506 motif from JASPAR, the trimmed motif contains the columns 4 to 12.

Contact

Any questions related to this pipeline, please contact the following persons:

References

1. Fornes,O., Castro-mondragon,J.A., Khan,A., Lee,R.V.D., Zhang,X., Richmond,P.A., Modi,B.P., Correard,S., Gheorghe,M. and Santana-garcia,W. et al. (2020) JASPAR 2020 : update of the open-access database of transcription factor binding profiles ˇ c. 48, 87–92.

2. Jolma,A., Yin,Y., Nitta,K.R., Dave,K., Popov,A., Taipale,M., Enge,M., Kivioja,T., Morgunova,E. and Taipale,J. (2015) DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature, 527, 384–8.

3. Zhu,F., Farnung,L., Kaasinen,E., Sahu,B., Yin,Y., Wei,B., Dodonova,S.O., Nitta,K.R., Morgunova,E. and Taipale,M. et al. (2018) The interaction landscape between transcription factors and the nucleosome. Nature.

4. Dogan,B., Kailasam,S., Corchado,A.H., Nikpoor,N. and Hamed,S. (2020) A domain-resolution map of in vivo DNA binding reveals the regulatory consequences of somatic mutations in zinc finger transcription factors. bioRxiv.

5. Nitta,K.R., Vincentelli,R., Jacox,E., Cimino,A., Ohtsuka,Y., Sobral,D., Satou,Y., Cambillau,C. and Lemaire,P. (2019) High-throughput protein production combined with high- throughput selex identifies an extensive atlas of ciona robusta transcription factor dna-binding specificities. In Vincentelli,R. (ed), High-throughput protein production and purification: Methods and protocols. Springer New York, New York, NY, pp. 487–517.

6. Yamada,N., Rossi,M.J., Farrell,N., Pugh,B.F. and Mahony,S. (2020) NAR Breakthrough Article Alignment and quantification of ChIP-exo crosslinking patterns reveal the spatial organization of protein – DNA complexes. Nucleic Acids Research, 10.1093/nar/gkaa618.

7. Nguyen,N.T.T., Contreras-Moreira,B., Castro-Mondragon,J.A., Santana-Garcia,W., Ossio,R., Robles-Espinoza,C.D., Bahin,M., Collombet,S., Vincens,P. and Thieffry,D. et al. (2018) RSAT 2018: Regulatory sequence analysis tools 20th anniversary. Nucleic Acids Research, 46, W209–W214.

8. Thomas-Chollier,M., Herrmann,C., Defrance,M., Sand,O., Thieffry,D. and Van Helden,J. (2012) RSAT peak-motifs: Motif analysis in full-size ChIP-seq datasets. Nucleic Acids Research, 40, 1–9.

9. Chenevy,J., Zacharie,M., Mestdagh,M., Rosnet,T., Douida,A., Rhalloussi,W., Lopez,F. and Ballester,B. (2020) ReMap 2020 : a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Research, 48, 180–188.

10. Yevshin,I., Sharipov,R. and Kolpakov,F. (2019) GTRD : a database on gene transcription regulation –– 2019 update. Nucleic Acids Research, 47, 100–105.

11. Oki,S., Ohta,T., Shioi,G., Hatanaka,H. and Ogasawara,O. (2018) ChIP-Atlas : a data-mining suite powered by full integration of public ChIP-seq data. 10.15252/embr.201846255.

12. Zheng,R., Wan,C., Mei,S., Qin,Q., Wu,Q., Sun,H., Chen,C.-H., Brown,M., Zhang,X. and Meyer,C.A. et al. (2018) Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Research, 10.1093/nar/gky1094.