Preparing raw sequencing data
-----------------------------


Matching number of Dam-only and Dam-protein samples
===================================================

In the directory containing config/workflow create a directory called reads:

.. code-block:: console
    
    $ cd /path/to/analysis
    $ mdkir -p reads

Data files from each group of biological replicates should be placed into a unique folder, e.g.:

.. code-block:: console
    
    reads/
    ├── exp1
    │   ├── Dam.fastq.gz
    │   └── Piwi.fastq.gz
    ├── exp2
    │   ├── Dam.fastq.gz
    │   └── Piwi.fastq.gz
    └── exp3
        ├── Dam.fastq.gz
        └── Piwi.fastq.gz

.. note::
    
    Single-end fastq files should always end with `fastq.gz`, while paired-end reads should end with *\_R1\_001.fastq.gz/\_R2\_001.fastq.gz*.

    Also, the Dam only control should always be called Dam.fastq.gz or Dam_R1_001.fastq.gz and Dam_R2_001.fastq.gz for single-end and paired-end reads, respectively.


Non-matching number of Dam-only and Dam-protein samples
=======================================================

In some cases the number of non-Dam and Dam samples might not match. In this case, place all the read files together in the reads directory as follows:

.. code-block:: console
    
    reads/
    ├── Dam_1.fastq.gz
    ├── Dam_2.fastq.gz
    ├── Piwi_1.fastq.gz
    ├── Piwi_2.fastq.gz
    └── Piwi_3.fastq.gz

When `DamMapper` is run is this case, it will create directories in reads/ for each Dam-only sample matching all non-Dam samples. Symlinks will be created in these directories to the original files in reads/:

.. code-block:: console

    reads/
    ├── Dam_1.fastq.gz
    ├── Dam_2.fastq.gz
    ├── Piwi_1.fastq.gz
    ├── Piwi_2.fastq.gz
    ├── Piwi_3.fastq.gz
    ├── repl_1
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_1.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_1.fastq.gz
    ├── repl_2
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_2.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_1.fastq.gz
    ├── repl_3
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_1.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_2.fastq.gz
    ├── repl_4
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_2.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_2.fastq.gz
    ├── repl_5
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_1.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_3.fastq.gz
    ├── repl_6
    │   ├── Dam.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Dam_2.fastq.gz
    │   └── Piwi.fastq.gz -> /mnt/4TB_SSD/analyses/DamID/test/reads/Piwi_3.fastq.gz
    └── sample_matrix.csv

`sample_matrix.csv`` file contains a log of which file was symlinked to which directory:

.. code-block:: console

    dir
    "['reads/repl_1', 'reads/Dam_1.fastq.gz', 'reads/Piwi_1.fastq.gz']"
    "['reads/repl_2', 'reads/Dam_2.fastq.gz', 'reads/Piwi_1.fastq.gz']"
    "['reads/repl_3', 'reads/Dam_1.fastq.gz', 'reads/Piwi_2.fastq.gz']"
    "['reads/repl_4', 'reads/Dam_2.fastq.gz', 'reads/Piwi_2.fastq.gz']"
    "['reads/repl_5', 'reads/Dam_1.fastq.gz', 'reads/Piwi_3.fastq.gz']"
    "['reads/repl_6', 'reads/Dam_2.fastq.gz', 'reads/Piwi_3.fastq.gz']"


Sample meta data and analysis settings
--------------------------------------

The config/ directory contains `samples.csv` with sample meta data as follows:

+-----------+----------+-----------+
| sample    | genotype | treatment |
+===========+==========+===========+
|Piwi       | WT       | None      |
+-----------+----------+-----------+
|Dam        | WT       | None      |
+-----------+----------+-----------+

`config.yaml` in the same directory contains the settings for the analysis:

.. code-block:: yaml
    
    genome: dm6
    ensembl_genome_build: 110
    plasmid_fasta: none # Path to plasmid fasta file with sequences to be removed
    fusion_genes: 
        genes: FBgn0004872 # Ensembl gene IDs for genes to be masked from the fasta file
        feature_to_mask: "exon" # Gene feature to mask from the fasta file (exon or gene)
    damidseq_pipeline:
        normalization: kde # kde, rpm or rawbins
        binsize: 300
        extra: "" # extra argument for damidseq_pipeline
    quantile_normalisation:
        apply: True
        extra: "" # extra argument for quantile_normalization
    deeptools:
        bamCoverage: # bam to bigwig conversion for QC
            binSize: 10
            normalizeUsing: RPKM
            extra: ""
    matrix: # Settings for computeMatrix
        mode: scale-regions # scale-regions or reference-point
        referencePoint: TSS # TSS, TES, center (only for reference-point mode)
        regionBodyLength: 6000
        upstream: 3000
        downstream: 3000
        binSize: 100
        averageTypeBins: mean
        regionsFileName: "" # BED or GTF file(s) with regions of interest (optional, whole genome if not specified)
        no_whole_genome: False # If True, will omit whole genome as region and only use regionsFileName(s)
        extra: "" # Any additional parameters for computeMatrix
    plotHeatmap:
        interpolationMethod: auto
        plotType: lines # lines, fill, se, std
        colorMap: viridis # https://matplotlib.org/2.0.2/users/colormaps.html
        alpha: 1.0
        extra: "" 
    peak_calling_perl:
        run: True
        iterations: 5 # N argument
        fdr: 0.01
        fraction: 0 # Fraction of random fragments to consider per iteration
        min_count: 2 # Minimum number of reads to consider a peak
        min_quantile: 0.95 # Minimum quantile for considering peaks
        step: 0.01 # Stepping for quantiles
        unified_peaks: max # Method for calling peak overlaps. 'min': call minimum overlapping peak area. 'max': call maximum overlap as peak
        extra: "" 
    peak_calling_macs3:
        run: False
        mode: narrow
        qvalue: 0.05 # for narrow peaks
        broad_cutoff: 0.1 # for broad peaks
        extra: ""
    consensus_peaks:
        max_size: 10 # Maximum size of peaks to be extended
        extend_by: 40 # Number of bp to extend peaks on either side
        keep: 2 # Minimum number peaks that must overlap to keep
        enrichment_analysis: 
        run: True # Perform enrichment analysis
        dbs: ["GO_Molecular_Function_2018","GO_Biological_Process_2018","KEGG_2019"]
        terms: 10 # Number of terms to plot
    resources: # computing resources
        trim:
            cpu: 8
            time: 60
        fastqc:
            cpu: 4
            time: 60
        damid:
            cpu: 24
            time: 720
            tmpdir: /tmp
        index:
            cpu: 40
            time: 60
        deeptools:
            cpu: 8
            time: 90
        plotting:
            cpu: 2
            time: 20


A lot of the DamID signal can come from the plasmids that are used to express the Dam-POIs, and this can skew the analysis.

To prevent this, two approaches are available:

1.  The genes (Ensembl gene IDs) fused to Dam can be set in config.yaml["fusion_genes] (separated by commas if multiple plasmids are used). This will mask the features set in config > fusion_genes > feature_to_mask (exons or gene) of these genes in the fasta file that will be used to build the Bowtie2 index, hence excluding these regions from the analysis. 

.. note::
    
    To disable this function set the value of config.yaml["fusion_genes"] to "".

2. If a plasmid is used that for example also uses an endogenous promoter besides the Dam fusion proteins, one can set a path to a fasta file containg all the plasmid sequences in config.yaml[""]. Trimmed reads are first aligned to these sequences, and the resulting non-aligning reads will then be processed as normal.

It is recommended to store this file in a directory called resources within the analysis folder (this folder will also contain all other non-experimental files such as fasta and gtf files).

.. note::
    
    To disable this function set the value of config.yaml["plasmid_fasta"] to none.


Configuration of Snakemake
--------------------------

Running Snakemake can entail quite a few command line flags. To make this easier these can be set in a global profile that is defined in a user-specific configuration directory in order to simplify this process.

.. code-block:: yaml
    
    cores: 40
    latency-wait: 20
    use-conda: True
    use-apptainer: True
    keep-going: False
    rerun-incomplete: True
    printshellcmds: True
    show-failed-logs: True


When running on a slurm-based HPC, the following lines should be included in `config.yaml`:

.. code-block:: yaml
    
    executor: slurm
    jobs: 100
    apptainer-args: "--bind '/parent_dir/of/analysis'" # if analysis in not in /home/$USER
    local-cores: 4 # Limit core usage for local rules
    default-resources:
            slurm_partition: icelake
            slurm_account: <ACCOUNT>


Some system have limited space allocated to /tmp, which can be problematic when using Apptainer. Add the following line to ~/.bashrc to set a different temporary directory location:

.. code-block:: console
    
    export APPTAINER_TMPDIR=~/path/to/tmpdir


Dry-run of the analysis
-----------------------

Before running the actual analyis with your own data, a dry-run can be performed:

.. code-block:: console
    
    $ snakemake -np


Visualization of the workflow
-----------------------------

To visualize the workflow run (this command excludes the target rule from the rule graph):

.. code-block:: console
    
    $ mkdir -p images
    $ snakemake --forceall --rulegraph | grep -v '\-> 0\|0\[label = \"all\"' | dot -Tpng > images/rule_graph.png

.. figure:: images/rule_graph.png
    :align: center
    :width: 1000

    Rule graph of the Snakemake workflow.


Running the analysis
--------------------

After a successful dry-run, the actual analysis can be initiated as follows:

.. code-block:: console
    
    $ snakemake --profile /home/user/.config/snakemake/profile


.. note::
    
    Do not use `~` in the path to the config file directory. Use the full path instead.


Report of the results
---------------------

When the analysis has finished succesfully, an HTML report can be created as follows:

.. code-block:: console
    
    $ snakemake --report report.html

This report will contain run time information for the Snakemake rules, as well as figures generated by the workflow, and the code used to create these.


Archive of the analysis
-----------------------

To archive the whole analysis, the following command can be used:

.. code-block:: console
    
    $ snakemake --archive /path/to/archive.tar.gz

This process will archive all code and configuration files under Git version control, include all input files in the archive, and incorporate the software packages from each specified Conda environment. The result is a self-contained workflow archive that can be re-executed on a clean machine with only Conda and Snakemake installed.

.. note::

    Please note that the archive is platform-specific. For instance, if it is created on a Linux system, it will run on any Linux version newer than the minimum version supported by the Conda packages at the time of archiving.