Benchmark (dorado)

1. Introduction

This benchmark compares the computational time for basecalling of nanopore sequencing waveform data using Dorado on hardware equipped with eight NVIDIA H100 SXM 80GB GPUs.

Dorado is a basecaller developed for Oxford Nanopore's DNA sequencers. A basecaller is software that converts waveform data obtained by measuring DNA molecules into base sequences. This transformation employs machine learning models, and GPUs are used as accelerators to rapidly and accurately convert several terabytes of waveform data into base sequences.

2. Verification GPU Environment

To compare the effective performance of the application, performance evaluations were conducted using nodes equipped with both H100 and V100 GPUs. The H100-equipped node used for verification was executed on a bare-metal server provided by Sakura Internet, featuring their high-performance PHY infrastructure (henceforth referred to as high-performance PHY), while the V100-equipped node was executed on the Thin computation node (Type 2b) of the NIG supercomputer system (henceforth referred to as NIG igt). Both environments were utilised as single-node configurations, with no application performance evaluations conducted on multi-node setups.

To avoid performance discrepancies due to differences in GPU drivers and other system configurations, the software environment settings from the NIG igt were adopted as verification conditions, and the same setup was replicated on the high-performance PHY.

The hardware and software environments of the nodes used for the verification are shown in Table 1.

Table 1: Verification Node Configuration

	NIG igt	High-Performance PHY
Hardware Configuration
CPU (Total Cores)	Intel Xeon Gold 6136 3.0GHz x 2 (24 cores)	Intel Xeon Platinum 8480 2.0GHz x 2 (112 cores)
Memory	DDR4 384GB	DDR5 2.0TB
GPU (FP64)	NVIDIA V100 SXM2 16GB (7.8 TFlops) x 4	NVIDIA H100 SXM5 80GB (33.5 Tflops) x 8
GPU Interconnect	NVLink Hybrid Cube Mesh	NVSwitch Fabric
System Disk	NVMe SSD 1.6TB x 1	NVMe SSD 960GB x 2 (RAID1 configuration)
Data Disk	NVMe SSD 3.2TB x 1	NVMe SSD 7.68TB x 4
Software Configuration
OS	Ubuntu Server 22.04 LTS	Ubuntu Server 22.04 LTS
GPU Driver	530.30.02	530.30.02
CUDA	12.1	12.1
Fabric Manager	N/A	UP
Singularity CE	4.0.0	4.0.0

2.1 Hardware Configuration

Regarding GPUs, the NIG igt node uses four V100 SXM2 16GB GPUs (Volta GV100 architecture) interconnected via NVLink Hybrid Cube Mesh. In contrast, the high-performance PHY node uses eight H100 SXM5 80GB GPUs (Hopper GH100 architecture) interconnected through NVSwitch, offering high-speed interconnects between the GPUs.

2.2 Software Configuration

The software stack, including the OS distribution, GPU drivers, and management tools, was configured to match as closely as possible between the NIG igt and high-performance PHY environments. This was done to minimise any discrepancies, apart from those related to the application implementation itself.

Given that NIG igt is a managed cluster, changes to drivers that are closely tied to the OS environment can be complex. Thus, the high-performance PHY, being a bare-metal server with greater flexibility, was adapted to match the NIG igt software configuration.

The software environment was set up on the Singularity container platform, which is widely used in actual field research, ensuring consistency with real-world workflows.

3. Storage Environment

Dorado requires a storage environment capable of accommodating sufficiently large datasets with high-speed read and write capabilities. Both NIG igt and high-performance PHY feature high-speed NVMe SSDs in their local storage, which were used as storage areas for application input and output.

The storage configurations of the NIG igt and high-performance PHY nodes, along with I/O performance measurements from sequential fio commands, are shown in Table 2.

The /tmp directory, which is the measurement target, refers to the NVMe SSD on the system disk. At NIG igt, there was little difference in performance across the /data area. On the other hand, at high-performance PHY, there was a performance difference of approximately seven times.

fio Command

fio --direct=1 --filename=/data/test --rw=readwrite --bs=1m --size=80G

Table 2: Storage Configuration and Sequential I/O Performance

	Measurement Target	READ	WRITE
NIG igt	/tmp	533MiB/s	531MiB/s
	/data	607MiB/s	605MiB/s
high-performance PHY	/tmp	145MiB/s	153MiB/s
	/data	1052MiB/s	1111MiB/s

4. Performance Evaluation

Performance evaluations were conducted on both high-performance PHY and NIG igt using several versions of Dorado, employing Docker images (nanoporetech/dorado) for the software environment.

Dorado provides basecalling functionality as well as epigenetic modification detection. Three calculation modes were evaluated:

without modification calling: A computational mode that does not determine epigenetic modification states.
with methylation calling 5mCG (hereafter referred to as with 5mCG calling): A computational mode that considers only the 5mCG epigenetic modification state. -with methylation calling 5mCG_5hmCG (hereafter referred to as with 5mCG/5hmCG calling): A computational mode that considers both the 5mCG and 5hmCG epigenetic modification states.

The input data used for the benchmark were FAST5 files, similar to those used in the basecaller benchmark conducted on Amazon. The files were split into 584 parts, each ranging from approximately 1GB to 1.7GB in size. The total dataset for analysis amounted to 765GB.

4.1 Execution of Dorado

The three different computation modes were tested with different Dorado commands (v0.5.1). The pod5 directory contained files that were downloaded from the aforementioned benchmark and converted from FAST5 format to pod5 format.

without modification calling

dorado basecaller \
    dna_r10.4.1_e8.2_400bps_hac@v3.5.2 \
    pod5/ \
    --verbose | \
    samtools view --threads 8 -O BAM -o output-wo.bam

with 5mCG calling

dorado basecaller \
    dna_r10.4.1_e8.2_400bps_hac@v3.5.2\
    pod5/ \
    --modified-bases 5mCG \
    --verbose | \
    samtools view --threads 8 -O BAM -o output-5mCG.bam

with 5mCG/5hmCG calling

dorado basecaller \
    dna_r10.4.1_e8.2_400bps_hac@v4.0.0\
    pod5/ \
    --modified-bases 5mCG_5hmCG \
    --verbose | \
    samtools view --threads 8 -O BAM -o output-5mCG-5hmCG.bam

4.2 Analysis Time by Node and Version

The analysis times for each node and Dorado version are shown in Table 3. As a reference, results from the Amazon benchmark are also provided, comparing p4d.24xlarge (A100x8) and p3.8xlarge (V100x4). Dorado v0.2.4 with 5mCG calling failed during analysis due to the GPU not being used and the system running out of memory, so it is marked as N.A. in the table. Dorado v0.2.4 with 5mCG/5hmCG calling was not executed due to the high likelihood that the results would be similar to those for 5mCG calling, and the focus was on obtaining results for other parameters (this is indicated by "-" in the table).

On high-performance PHY, the analysis was approximately 8 to 15 times faster in the without modification mode, and about 4 to 6 times faster in other modes, compared to NIG igt.

On high-performance PHY, the without modification mode showed about 1.3 times faster performance in versions v0.2.4 and v0.3.0 compared to v0.2.4 and v0.3.0 onwards, and about 1.25 times faster performance for with 5mCG calling in versions v0.3.0 and v0.5.0. However, for with 5mCG/5hmCG calling, no significant performance differences were observed between the versions.

For NIG igt, the trend differed, with analysis time increasing after version v0.5.0 for all modes. This change likely reflects the optimisations for GPUs like the H100, which may not have been beneficial for the V100s in the NIG igt system.

When comparing the same versions on high-performance PHY, without modification calling was found to be at least twice as fast as other computation modes.

Changes in GPU usage and CPU usage during analysis for each mode can be seen in Figure 1, Figure 2, and Figure 3, and CPU usage trends are shown in Figure 4, Figure 5, and Figure 6. The differences in GPU and CPU usage suggest that in modes other than without modification calling, the frequent switching of processes between GPU and CPU may be due to the consideration of epigenetic modifications.

In contrast, for NIG igt, no significant differences were observed across the computation modes. This indicates that other factors are likely limiting the performance, and further investigation is needed.

Finally, comparing with the Amazon benchmark results, high-performance PHY completes analysis in about half the time of the p4d.24xlarge in without modification mode. Additionally, NIG igt is approximately 1.2 times faster than the p3.8xlarge, which has the same GPU configuration.

Table 3: Comparison of Analysis Speed by Node and Dorado Version

GPU	Dorado Version	Analysis Time (minutes) w/o modification calling	Analysis Time (minutes) w/ 5mCG calling	Analysis Time (minutes) w/ 5mCG/5hmCG calling
high-performance PHY (H100x8)	v0.2.4	24.30	N.A.	-
	v0.3.0	18.03	50.05	45.97
	v0.5.0	18.66	39.99	42.74
	v0.5.1	15.97	42.80	42.48
NIG igt (V100x4)	v0.2.4	211.13	215.06	224.93
	v0.3.0	211.05	214.86	221.89
	v0.5.0	244.70	267.30	232.26
	v0.5.1	242.08	263.06	231.39
Amazon Benchmark p4d.24xlarge (A100x8)	v0.2.4	48.00	48.00	54.00
Amazon Benchmark p3.8xlarge (V100x4)	v0.2.4	258.00	264.00	282.00

Figure 1: GPU Usage during Analysis with Dorado v0.5.1 (without modification calling)

Figure 2: GPU Usage during Analysis with Dorado v0.5.1 (with 5mCG calling)

Figure 3: GPU Usage during Analysis with Dorado v0.5.1 (with 5mCG/5hmCG calling)

Figure 4: CPU Usage during Analysis with Dorado v0.5.1 (without modification calling)

Figure 5: CPU Usage during Analysis with Dorado v0.5.1 (with 5mCG calling)

Figure 6: CPU Usage during Analysis with Dorado v0.5.1 (with 5mCG/5hmCG calling)

5. Conclusion

This study demonstrates that the use of computing nodes equipped with the latest H100 GPUs can significantly accelerate genome analysis using Dorado. However, the limited period during which the High-Performance PHY was available left some aspects unresolved, and further investigations are planned in collaboration with relevant parties.

6. About this Article

The findings presented here were generated based on measurements conducted in the fiscal year 2023 as part of the joint research project "Research on the Lifecycle Design of Large-Scale Research Data" (2020–) between the National Institute of Genetics (NIG) DDBJ Centre and Sakura Internet Inc.

7. Article Details and Related Link

Article Creation Date: 6th June 2024

7.1 Project Members

Tomoya Tanjo^{1, 2}
Shun Nogawa³
Kentaro Yamamoto³
Manabu Ishii³
Tatsuro Ohta^{1, 4}
Fumikazu Konishi⁵
Tsuyoshi Hachiya³
Osamu Ogasawara¹

1: National Institute of Genetics, Bioinformation and DDBJ Center 2: Joint Support-Center for Data Science Research, BioData Science Initiative 3: Genome Analytics Japan Inc.
4: Chiba University, Institute for Adovanced Academic Research 5: Sakura Internet Inc.

1. Introduction​

2. Verification GPU Environment​

Table 1: Verification Node Configuration​

2.1 Hardware Configuration​

2.2 Software Configuration​

3. Storage Environment​

fio Command​

Table 2: Storage Configuration and Sequential I/O Performance​

4. Performance Evaluation​

4.1 Execution of Dorado​

without modification calling​

with 5mCG calling​

with 5mCG/5hmCG calling​

4.2 Analysis Time by Node and Version​

Table 3: Comparison of Analysis Speed by Node and Dorado Version​

Figure 1: GPU Usage during Analysis with Dorado v0.5.1 (without modification calling)​

Figure 2: GPU Usage during Analysis with Dorado v0.5.1 (with 5mCG calling)​

Figure 3: GPU Usage during Analysis with Dorado v0.5.1 (with 5mCG/5hmCG calling)​

Figure 4: CPU Usage during Analysis with Dorado v0.5.1 (without modification calling)​

Figure 5: CPU Usage during Analysis with Dorado v0.5.1 (with 5mCG calling)​

Figure 6: CPU Usage during Analysis with Dorado v0.5.1 (with 5mCG/5hmCG calling)​

5. Conclusion​

6. About this Article​

7. Article Details and Related Link​

7.1 Project Members​

7.2 Related Links​