Thursday, June 19, 2025

Impact of mdadm -c, --chunk on random read/write performance and disk space utilization.

No one knows exactly what this is in context of mdadm, but this must be the minimum i/o size of the RAID block device. Regardless, I did some random read/write tests using various chunk sizes using seekmark. mdadm RAID creation parameters -- 

mdadm -C /dev/md/test -l 5 --home-cluster=xxx --homehost=any -z 10G -p left-symmetric -x 0 -n 3 -c 512K|64K --data-offset=8K -N xxxx -k resync 

XFS format parameters -- 

mkfs.xfs -m rmapbt=0,reflink=0

Seekmark commands -- 

seekmark -i $((32*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((64*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((128*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((256*1024)) -t 1 -s 1000 -f /mnt/archive/test-write

512K chunks -- 

seekmark 32K: 163.64
seekmark 64K: 153.89
seekmark 128K: 145.77
seekmark 256K: 130.16

64K chunks -- 

seekmark 32K: 145.33
seekmark 64K: 133.40
seekmark 128K: 121.04
seekmark 256K: 99.60

Unit is seeks/sec

Therefore, for some reason 512K chunks win even for small reads.

For 32K writes, I was getting Around 53 seeks/s write using 512K chunks and 49 seeks/s for 64K chunks, so here too large chunk size wins by a small margin (and maybe there no difference at all).

For the disk space utilization, large chunk size too wins when used with the same underlying xfs FS. For the test, 400000 4K sized files where created. At 4K chunk size 1.9G of space was used and at 16K chunk size, 1.8G space was used.

Tuesday, June 10, 2025

mdadm (RAID 5) performance under different parity layouts (-p --parity --layout)

While performance of right-asymmetric, left-asymmetric, right-symmetric, left-symmetric, is roughly the same, the performance of parity-last and parity-first seems strikingly fast for reads.

Tests were done on a RAID 5 setup over 3 USB hard drivers each have 10TB capacity. Each HDD is capable of 250+ MBPS simultaneously (therefore the USB link is not saturated).

The optimal chunk size for right-asymmetric, left-asymmetric, right-symmetric, left-symmetric starts at 32KB where the sequential read speeds are around 475MB/s. At 256KB and 512KB chunks, the read speeds slightly improve to around 483MB/s. Below 32KB chunks, the read speeds suffer significantly where I get 120MB/s reads at 4K chunks. The write speeds are around 480MB/s even for 4KB chunks and remains the same even for 512KB chunks (no tests where done beyond this size).

With parity-last/first you can afford to have a lower chunk size with the same read performance. For e.g. at 16K chunks, I was getting writes of 488MB/s and reads of 478MB/s writes. However the best lowest chunk size was 32K where I was getting 490MB/s writes and 506MB/s reads. The performance remained the same upto 512K chunk size. Therefore in a 3 disk RAID-5 setup, parity-last/first gives the optimal performance at a lower chunk size (compared to other parity layouts) which MUST be a good deal, however as per other tests done both lower chunk size and parity-last/first is not a good idea.

The problem with parity-last/first is that the writes do not scale beyond 2 data disks (i.e. 3 disks in total), which was a RAID-4 problem and parity-last/first IS a raid 4 layout. Technically, the random writes must not scale and it must not impact sequential writes, but it seems it does not scale even for sequential writes. Synthetic tests where done by starting a VM in qemu with 5 block devices, each of which was throttled to 5MB/s. These are the tests done (with 5 disks) -- 

create qemu-storage --
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage1.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage2.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage3.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage4.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage5.qcow2 20G

Launch qemu -- 

qemu-system-x86_64 -machine accel=kvm,kernel_irqchip=on,mem-merge=on -drive file=template_trixie.raid5.qcow2,id=centos,if=virtio,media=disk,cache=unsafe,aio=threads,index=0 -drive file=RAID5-test-storage1.qcow2,id=storage1,if=virtio,media=disk,cache=unsafe,aio=threads,index=1,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage2.qcow2,id=storage2,if=virtio,media=disk,cache=unsafe,aio=threads,index=2,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage3.qcow2,id=storage3,if=virtio,media=disk,cache=unsafe,aio=threads,index=3,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage4.qcow2,id=storage4,if=virtio,media=disk,cache=unsafe,aio=threads,index=4,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage5.qcow2,id=storage5,if=virtio,media=disk,cache=unsafe,aio=threads,index=5,throttling.bps-total=$((5*1024*1024)) -vnc [::1]:0 -device e1000,id=ethnet,netdev=primary,mac=52:54:00:12:34:56 -netdev tap,ifname=veth0,script=no,downscript=no,id=primary -m 1024 -smp 12 -daemonize -monitor pty -serial pty > /tmp/vm0_pty.txt

mdadm parameters for parity-last/first --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p parity-last -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

mdadm parameters for left-symmetric --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p left-symmetric -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

Write test --
cat /dev/urandom | tee /dev/stdout | tee /dev/stdout| tee /dev/stdout| tee /dev/stdout| tee /dev/stdout| tee /dev/stdout | tee /dev/stdout| tee /dev/stdout| tee /dev/stdout | dd of=/dev/md/bench bs=1M count=100 oflag=direct iflag=fullblock

 Read test --
dd if=/dev/md/bench of=/dev/null bs=1M count=100 iflag=direct

For the writes, I was getting getting 10MB/s with parity-last and 13.4MB/s left-symmetric (34% higher).

For reads I was getting 21.8MB/s with parity-last and 27.6MB/s with left-symmetric

Therefore it seems left-symmetric was scaling better in every way.

To ensure nothing was wrong with the test setup, I repeated the same test for parity-last/first with 3 disks instead and I was getting 10.7MB/s writes and 10.7MB/s reads.

With this I come to the conclusion, that parity-last/first scales for writes for at best 2 disks in the best case scenario. Yes, agree I was getting a little extra speed for reads with left-symmetric with 5 disks (because theoretically it must be upto 20MB/s), but why did it happen exactly is beyond my understanding.

As of why smaller chunk size is not a good idea, I'll write about that in another blog post.