Thursday, June 19, 2025

Impact of mdadm -c, --chunk on random read/write performance and disk space utilization.

No one knows exactly what this is in context of mdadm, but this must be the minimum i/o size of the RAID block device. Regardless, I did some random read/write tests using various chunk sizes using seekmark. mdadm RAID creation parameters -- 

mdadm -C /dev/md/test -l 5 --home-cluster=xxx --homehost=any -z 10G -p left-symmetric -x 0 -n 3 -c 512K|64K --data-offset=8K -N xxxx -k resync 

XFS format parameters -- 

mkfs.xfs -m rmapbt=0,reflink=0

Seekmark commands -- 

seekmark -i $((32*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((64*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((128*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((256*1024)) -t 1 -s 1000 -f /mnt/archive/test-write

512K chunks -- 

seekmark 32K: 163.64
seekmark 64K: 153.89
seekmark 128K: 145.77
seekmark 256K: 130.16

64K chunks -- 

seekmark 32K: 145.33
seekmark 64K: 133.40
seekmark 128K: 121.04
seekmark 256K: 99.60

Unit is seeks/sec

Therefore, for some reason 512K chunks win even for small reads.

For 32K writes, I was getting Around 53 seeks/s write using 512K chunks and 49 seeks/s for 64K chunks, so here too large chunk size wins by a small margin (and maybe there no difference at all).

For the disk space utilization, large chunk size too wins when used with the same underlying xfs FS. For the test, 400000 4K sized files where created. At 4K chunk size 1.9G of space was used and at 16K chunk size, 1.8G space was used.

Tuesday, June 10, 2025

mdadm (RAID 5) performance under different parity layouts (-p --parity --layout)

While performance of right-asymmetric, left-asymmetric, right-symmetric, left-symmetric, is roughly the same, the performance of parity-last and parity-first seems strikingly fast for reads.

Tests were done on a RAID 5 setup over 3 USB hard drivers each have 10TB capacity. Each HDD is capable of 250+ MBPS simultaneously (therefore the USB link is not saturated).

The optimal chunk size for right-asymmetric, left-asymmetric, right-symmetric, left-symmetric starts at 32KB where the sequential read speeds are around 475MB/s. At 256KB and 512KB chunks, the read speeds slightly improve to around 483MB/s. Below 32KB chunks, the read speeds suffer significantly where I get 120MB/s reads at 4K chunks. The write speeds are around 480MB/s even for 4KB chunks and remains the same even for 512KB chunks (no tests where done beyond this size).

With parity-last/first you can afford to have a lower chunk size with the same read performance. For e.g. at 16K chunks, I was getting writes of 488MB/s and reads of 478MB/s writes. However the best lowest chunk size was 32K where I was getting 490MB/s writes and 506MB/s reads. The performance remained the same upto 512K chunk size. Therefore in a 3 disk RAID-5 setup, parity-last/first gives the optimal performance at a lower chunk size (compared to other parity layouts) which MUST be a good deal, however as per other tests done both lower chunk size and parity-last/first is not a good idea.

The problem with parity-last/first is that the writes do not scale beyond 2 data disks (i.e. 3 disks in total), which was a RAID-4 problem and parity-last/first IS a raid 4 layout. Technically, the random writes must not scale and it must not impact sequential writes, but it seems it does not scale even for sequential writes. Synthetic tests where done by starting a VM in qemu with 5 block devices, each of which was throttled to 5MB/s. These are the tests done (with 5 disks) -- 

create qemu-storage --
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage1.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage2.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage3.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage4.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage5.qcow2 20G

Launch qemu -- 

qemu-system-x86_64 -machine accel=kvm,kernel_irqchip=on,mem-merge=on -drive file=template_trixie.raid5.qcow2,id=centos,if=virtio,media=disk,cache=unsafe,aio=threads,index=0 -drive file=RAID5-test-storage1.qcow2,id=storage1,if=virtio,media=disk,cache=unsafe,aio=threads,index=1,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage2.qcow2,id=storage2,if=virtio,media=disk,cache=unsafe,aio=threads,index=2,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage3.qcow2,id=storage3,if=virtio,media=disk,cache=unsafe,aio=threads,index=3,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage4.qcow2,id=storage4,if=virtio,media=disk,cache=unsafe,aio=threads,index=4,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage5.qcow2,id=storage5,if=virtio,media=disk,cache=unsafe,aio=threads,index=5,throttling.bps-total=$((5*1024*1024)) -vnc [::1]:0 -device e1000,id=ethnet,netdev=primary,mac=52:54:00:12:34:56 -netdev tap,ifname=veth0,script=no,downscript=no,id=primary -m 1024 -smp 12 -daemonize -monitor pty -serial pty > /tmp/vm0_pty.txt

mdadm parameters for parity-last/first --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p parity-last -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

mdadm parameters for left-symmetric --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p left-symmetric -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

Write test --
cat /dev/urandom | tee /dev/stdout | tee /dev/stdout| tee /dev/stdout| tee /dev/stdout| tee /dev/stdout| tee /dev/stdout | tee /dev/stdout| tee /dev/stdout| tee /dev/stdout | dd of=/dev/md/bench bs=1M count=100 oflag=direct iflag=fullblock

 Read test --
dd if=/dev/md/bench of=/dev/null bs=1M count=100 iflag=direct

For the writes, I was getting getting 10MB/s with parity-last and 13.4MB/s left-symmetric (34% higher).

For reads I was getting 21.8MB/s with parity-last and 27.6MB/s with left-symmetric

Therefore it seems left-symmetric was scaling better in every way.

To ensure nothing was wrong with the test setup, I repeated the same test for parity-last/first with 3 disks instead and I was getting 10.7MB/s writes and 10.7MB/s reads.

With this I come to the conclusion, that parity-last/first scales for writes for at best 2 disks in the best case scenario. Yes, agree I was getting a little extra speed for reads with left-symmetric with 5 disks (because theoretically it must be upto 20MB/s), but why did it happen exactly is beyond my understanding.

As of why smaller chunk size is not a good idea, I'll write about that in another blog post.

Wednesday, May 28, 2025

Re-writing HDDs to avoid bitrot/degradation under linux.

Over the years your archival HDDs are susceptible to bitrot. You've to re-write them on regular intervals to prevent that. You can use dd for that --

dd=/dev/sdX of=/dev/sdX bs=1M conv=notrunc iflag=fullblock

This is even resilient to power failures (I tested that over a VM).

Tuesday, April 29, 2025

ffmpeg: Audio/video out of sync in ffmpeg when frame rate limit is set using -r.

 If you've specified -r at the input, you may like to try moving it before the -vcodec to resolve the issue. With this change, the input is not frame limited, but the encoding is frame limited.

Ext4 vs xfs (with and without rmapbt) massive small file operations benchmark

 Methodology

/mnt/tmpfs/ contains trimmed linux sources. Large files where removed to reduce the size of the total storage to 5GB. /mnt/tmpfs/ is a tmpfs filesystem.

The following are the benchmarks done --
Copy operation --
time cp -a /mnt/tmpfs/* /mnt/temp/
Cold search --
time find /mnt/temp/ -iname '*a*' > /dev/null
Warm search --
time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
read all files in an alphabetic way (cold) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
read all files in an alphabetic way (warm) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
cd /mnt/temp/
find /mnt/temp/ -type f > /tmp/flist.txt
dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
time write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
Delete dir tree --
time rm -rf /mnt/temp/*

HDD benchmarks

mount and mkfs options

mount paramters for xfs - -
mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 -- 
mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc
nodelalloc was removed since bigalloc was removed.
ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 -- 
mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096
bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).
 
mkfs.xfs -f -m rmapbt=0,reflink=0

Results -- 

ext4 --
Create/copy --
0m27.925s
Cold search --
0m0.157s
Warm search --
0m1.509s
read all files in an alphabetic way (cold) (parallel) --
0m0.253s
read all files in an alphabetic way (warm) (parallel) --
0m0.252s
Write a certain small value to all files alphabetically in parallel --
11m41.727s
Delete dir tree --
0m1.161s

xfs --
Create/copy --
0m21.857s
Cold search --
0m0.081s
Warm search --
0m0.752s
read all files in an alphabetic way (cold) (parallel) --
0m0.239s
read all files in an alphabetic way (warm) (parallel) --
0m0.238s
Write a certain small value to all files alphabetically in parallel --
11m43.711s
Delete dir tree --
0m1.086s

Conclusion -- 

Despite rmapbt being disabled in XFS (which improves performance with small files), XFS is faster than ext4 in most tests. If this ext4 FS (which is optimized for large files) is used for operations on large files, expect lower performance.

SSD benchmarks

mount and mkfs options

blkdiscard done before each benchmark.
 
 
mount paramters for xfs - -
mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 -- 
mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc
nodelalloc was removed since bigalloc was removed.
ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 -- 
mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096
bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).
 
xfs with no rmapbt --
mkfs.xfs -f -m rmapbt=0,reflink=0

xfs with rmapbt -- 
mkfs.xfs -f -m rmapbt=1,reflink=0

Results -- 

ext4 --
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
        real    0m48.826s
        user    0m0.204s
        sys     0m3.005s
        
        real    0m48.290s
        user    0m0.246s
        sys     0m2.898s

    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
        real    0m0.172s
        user    0m0.074s
        sys     0m0.097s
        
        real    0m0.169s
        user    0m0.064s
        sys     0m0.105s
        
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
        real    0m1.616s
        user    0m0.536s
        sys     0m1.075s
        
        real    0m1.651s
        user    0m0.615s
        sys     0m1.031s
        
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.444s
    user    0m0.227s
    sys     0m2.850s
    
    real    0m0.402s
    user    0m0.271s
    sys     0m2.793s
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.407s
    user    0m0.230s
    sys     0m2.851s
    
    real    0m0.402s
    user    0m0.223s
    sys     0m2.845s
    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    9m59.305s
    user    9m53.748s
    sys     0m51.903s
    
    real    9m38.867s
    user    9m33.476s
    sys     0m49.930s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m0.824s
    user    0m0.021s
    sys     0m0.743s
    
    real    0m0.820s
    user    0m0.038s
    sys     0m0.718s
xfs rmapbt=0
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m14.851s
    user    0m0.298s
    sys     0m3.860s
    
    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.054s
    sys     0m0.027s
    
    
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.694s
    user    0m0.511s
    sys     0m0.179s
    
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.277s
    sys     0m2.680s
    
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.388s
    user    0m0.256s
    sys     0m2.705s

    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    10m45.878s
    user    10m40.476s
    sys     0m7.636s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m1.181s
    user    0m0.030s
    sys     0m0.482s
xfs rmapbt=1
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m2.883s
    user    0m0.159s
    sys     0m2.556s

    
    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.049s
    sys     0m0.033s
    
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.700s
    user    0m0.480s
    sys     0m0.216s
    
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.218s
    sys     0m2.752s
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.229s
    sys     0m2.739s
    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    8m53.297s
    user    8m48.394s
    sys     0m9.786s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m2.373s
    user    0m0.024s
    sys     0m0.498s

Conclusion -- 

When comparing xfs rmapbt=1 and xfs rmapbt=0, rmapbt=1 wins on average (but not by a large margin).

When comparing xfs rmapbt=1 and ext4, xfs wins by a large margin

Monday, April 28, 2025

Debian trixie vs Gentoo benchmark.

Recently I came across this benchmark, which although old, but laughable (if you don't know why, I suggest you either readup more about machine code or remain a happy Ubuntu user) because of the inaccurate benchmark method in regards to Gentoo.

Also at this time I just installed Debian trixie (still in testing) for another machine and realized that versions of various applications in their repositories where striking similar. So I decided to to also do a casual benchmark, which although is not that accurate, but FAR more than that phoronix benchmark.

 Openssl (higher the better) -- 

 Firefox https://browserbench.org/Speedometer2.1/ (higher the better) -- 

 CPU and real run time of various CPU intensive applications (lower the better) -- 

 xz real and CPU time taken(lower the better) -- 

bash script benchmark results (lower the better) -- 


The machine is a Ryzen 5 PRO 2600 -- which is an old machine (x86_64-v3 instruction set). The highest contrast with the benchmark must be seen with newer processors specially x86_64-v4 (avx512) ones because binary distributions (except clearlinux) are optimized for x86_64 baselines which is 3 generations behind the latest. In short you're not fully utilizing your shiny new x86_64-v4 processors unless you use Gentoo. In these matters, even Windows is better off because it's hefty 'minimum requirement' just for running the OS implies they can compile binaries above the baseline x86_64 instruction set.

As of now, I'm not able to get chromium to run on Gentoo because of the GPU of the machine has been blacklisted as per chrome. It works on Intel platform though.

Many of the application may use assembly code. These application perform the same regardless of the of optimization applied by GCC. Common applications include openssl, various video codec libraries, prime95 etc... but I'm not entirely sure how much of assembly they're using; this is the reason why I chose sparsely used algos in openssl for benchmark purposes since the developer is less likely to do efforts for a less used algo.

May applications are not bottlenecked by the CPU, even though it may seem so, that's because they put more stress on the memory speeds than the CPU. Even when the memory is the bottleneck, the CPU utilization is reported as 100% because of how closely the memory and CPU work. e.g. is compression workloads. In these benchmarks, there will not be much of a difference.

imagemagick's compare was able to run on all 12 CPUs on Debian, but only 2 CPUs on Gentoo. As a result, I limited the benchmark to 2 CPUs, however in this configuration, Debian's build of imagemagic took double the time to gentoo's. Because of the large difference I really doubt this is because of the optimization differences between the 2 builds. For larger images, gentoo's build is able to use all 12 CPUs, but since it was taking too much time (for both Debian and Gentoo) I abandoned it.

Package versions of Gentoo -- 

imagemagick-7.1.1.38-r2

bash-5.2_p37

openssl-3.3.3

firefox-128.8.0

ffmpeg-6.1.2-r1

xz-utils-5.6.4-r1

grep-3.11-r1

gcc - 14.2.1_p20250301 (all packages where built using this version. CFLAGS in make.conf where -march=znver1 --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=512 -fomit-frame-pointer -floop-interchange -floop-strip-mine -floop-block -fgraphite-identity -ftree-loop-distribution -O3 -pipe -flto=1 -fuse-linker-plugin -ffat-lto-objects -fno-semantic-interposition, however a few packages (like firefox) iron many of the CFLAGs out).

Package versions for Debian -- 

imagemagick-7.1.1.43+dfsg1-1

bash-5.2.37-1.1+b2

openssl-3.4.1-1

firefox-128.9.0esr-2

ffmpeg-7.1.1-1+b1

xz-utils-5.8.1-1

grep-3.11-4

gcc-14.2

The Debian is a fresh install, while the Gentoo installation is from 2009. Over the years, the same installation has been migrated/replicated across multiple machines. Debian was installed on a pendrive while Gentoo was installed on an SSD; of course disk i/o was noticed during the benchmark and only CPU was the bottleneck (there was no i/o wait). All data for the benchmark was loaded from an external HDD (here too disk i/o was not the bottleneck).

For the source of the benchmark download from here. These are it's contents -- 

script.sh -- The script which was run for the benchmark.

ff-bench_debian.png/ff-bench_gentoo.png -- Screenshot of FF benchmark (which of course the script did not run).

benchmark_results_debian.txt/result_gentoo.txt -- output of script.sh

shell_bench_Result_gentoo.txt/shell_bench_Result_gentoo.txt -- Output of shell-bench.sh on Gentoo/debian.

shell-bench.sh -- Grep and bash benchmark script.