DENTIST: Test Data
This directory contains all the data and commands required to produce the results presented in the manuscript
Arne Ludwig, Martin Pippel, Gene Myers, Michael Hiller. DENTIST – close assembly gaps with high confidence. In preparation.
Index of Files
Table of Contents
Naming Conventions
Filename | Meaning |
---|---|
dentist |
DENTIST |
pbjelly |
PBJelly |
finisher_sc |
FinisherSC |
lr_gapcloser |
LR_gapcloser |
arrow |
PacBio GenomicConsensus (quiver or arrow) |
Directory Structure
./data
Ground-truth and test assemblies and matching read data (simulated and real). The data is grouped by ground-truth assembly which is generally different between simulated and real reads (except for C. anna).
Structure of ./data/*/
assembly-reference.{fasta,dam}
: Ground-truth assembly in FAST/A and Dazzler’s DAM format, respectively.assembly-reference.mapped-raw.bed
,.assembly-reference.mapped.{anno,data}
: Masks of the mapped contigs in BED and Dazzler format, respectively.assembly-test.{fasta,dam}
: Test assembly with “copied gaps” (see mapped contigs above).closable-gaps.assembly-reference.reads-simulated-pb.json
: Report on closable gaps.reads-simulated-pb.{fasta,db,mapping.csv}
: Simulated read data and sample locations (mapping) generated with:simulator \ -m25000 -s12500 -e.13 -r$(<seed) -Mreads-simulated.mapping.csv` \ assembly-reference.dam | \ tee reads-simulated-pb.fasta | \ fasta2DB -i reads-simulated-pb.db
reads-real-pb.{fasta,db}
,reads-real-pb/*.bam
: Real PacBio read data in FAST/A and Dazzler DB format as well as raw sequencing data in BAM format.reads-real-onp.{fasta,dam}
: Real ONT (yes, there is a typo) read data in FAST/A and Dazzler DAM format.seed
: Seed for random number generator of the read simulator.
List of subfolders
-
./data/d_melanogaster: Drosophila melanogaster with simulated reads. This contains the simulated reads for the coverage series as well.
-
./data/d_melanogaster_pacbio: Drosophila melanogaster with real PacBio reads.
-
./data/a_thaliana: Arabidopsis thaliana with simulated reads.
-
./data/a_thaliana_pacbio: Arabidopsis thaliana with real PacBio reads.
-
./data/c_anna: Calyptae anna with simulated and real PacBio reads.
-
./data/h_sapiens: Homo sapiens with simulated reads.
-
./data/h_sapiens_real: Homo sapiens with real PacBio and ONT reads.
./results
Gap-closed assemblies and results of the automatic evaluation.
./source
Scripts and workflow files required to run the gap closing software and analysis.
Executing the Tools
DENTIST
cd ./source/dentist COMPARISON_DATASETS=( d_melanogaster/simulated-pb d_melanogaster_pacbio/real-pb a_thaliana/simulated-pb a_thaliana_pacbio/real-pb c_anna/simulated-pb c_anna/real-pb h_sapiens/simulated-pb h_sapiens_real/real-pb h_sapiens_real/real-onp ) # runs for comprehensive comparison for DATASET in "${COMPARISON_DATASETS[@]}" do SKIP_LACHECK=1 ./snakemake_dentist.sh "$DATASET" \ -p --profile=slurm --restart-times=2 ../check-result.sh dentist "$DATASET" all done # runs for scaffolding analysis for DATASET in "${COMPARISON_DATASETS[@]}" do SKIP_LACHECK=1 ./snakemake_dentist.sh --base-config=scaffolding \ "$DATASET" \ -p --profile=slurm --restart-times=2 done COVERAGE_DATASETS=( d_melanogaster/simulated-pb-{5,6,7,8,9,10,12,14,16,18,20,25,30,40,50,60,70,80,90,100}x ) # runs for coverage analysis for DATASET in "${COVERAGE_DATASETS[@]}" do SKIP_LACHECK=1 ./snakemake_dentist.sh "$DATASET" \ -p --profile=slurm --restart-times=2 ../check-result.sh dentist "$DATASET" all done
PBJelly
cd ./source/pbjelly COMPARISON_DATASETS=( d_melanogaster d_melanogaster_pacbio a_thaliana a_thaliana_pacbio c_anna c_anna_pacbio h_sapiens h_sapiens_real ) for DATASET in "${COMPARISON_DATASETS[@]}" do ./pipeline.sh "$DATASET" ../check-result.sh pbjelly "$DATASET" all done
FinisherSC
cd ./source/finisher_sc COMPARISON_DATASETS=( d_melanogaster d_melanogaster_pacbio ) for DATASET in "${COMPARISON_DATASETS[@]}" do snakemake --configfile="config/$DATASET.yml" \ -p --profile=slurm --restart-times=2 ../check-result.sh finisher_sc "$DATASET" all done
LR_gapcloser
cd ./source/lr_gapcloser COMPARISON_DATASETS=( d_melanogaster d_melanogaster_pacbio a_thaliana a_thaliana_pacbio c_anna c_anna_pacbio h_sapiens h_sapiens_real ) for DATASET in "${COMPARISON_DATASETS[@]}" do snakemake --configfile="$DATASET.yml" \ -p --profile=slurm --restart-times=2 ../check-result.sh lr_gapcloser "$DATASET" all done
PacBio GenomicConsensus
cd ./source/arrow COMPARISON_DATASETS=( d_melanogaster_pacbio a_thaliana_pacbio c_anna_pacbio h_sapiens_real ) for DATASET in "${COMPARISON_DATASETS[@]}" do snakemake --configfile="config/$DATASET.yml" \ -p --profile=slurm --restart-times=2 ../check-result.sh arrow "$DATASET" all done