Monday, December 8, 2025

The definitive encrypted (MD crypt) Linux MD RAID 5 guide for archival over HDDs

Following this guide you'll be able to create and do various operations (like add a new disk, replace a disk, do recovery etc...) on a hard disk based RAID 5 array.

1st identify your disks from /dev/disk/by-id/

Setup array --

for i in <your identified disk names in /dev/disk/by-id/, space separated> do cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue --sector-size 4096 -d - -c aes-xts-essiv:sha256 --key-size 256 create raid-$i /dev/disk/by-id/$i; done

Enter and remember your passphrase -- there is no option to forget it.

mdadm -C /dev/md/archive -l 5 --home-cluster=archive --homehost=any -z <your target disk size in KB>K -p left-symmetric -x 0 -n 3 -c 512K --data-offset=8K -N archive -k resync /dev/mapper/raid-{<your identified disk names in /dev/disk/by-id/, comma separated>}

Chose your <your target disk size in KB> in KB carefully. Otherwise in the future you may not be able to add marginally smaller disks. Study the market on available disk sizes (the exact disk size, not approximate) and set this value accordingly. For eg. for 10TB HDDs, this value a value of 9765623976K is safe enough to add a 10 TB disk from another vendor. -c 512K -- this stripe size may not be optimal for your setup; there maybe other values which may improve performance.

Create FS --

mkfs.xfs -m rmapbt=0,reflink=0 /dev/md/archive

Integrity testing --

Note that raid5 and raid6 do not check data consistency on assembly automatically (therefore it’s not going to detect corrupt data). I've found that XFS's crc32 feature is also not good enough for the purpose (it can only detect major corruption). Therefore you must use any intrusion detection solution (like Aide) for this and regularly do a test using aide for all the data of the archive.

You may also use mdadm --action=check /dev/md/archive or both aide and --action=check.

Aide setup --

Version needed v0.19 minimum.

This is the config file of aide.conf that you need to use --

database_in=file:<aide dir path>/init.dbgz
database_out=file:<aide dir path>/current-next.dbgz
database_attrs=sha3_512
gzip_dbout=true
warn_dead_symlinks=true
# RAID is faster with multithread, but too many threads will slow it down.
num_workers=<your specific values>
report_url=file:<aide dir path>/report-latest.log
report_level=added_removed_entries
report_format=json
report_detailed_init=false
report_append=true
<mount path of your RAID array> ftype+sha3_512

<aide dir path> is any place of your liking .

Fill in your RAID array with some data (once you assemble it) and initialize aide (TO BE RUN ONCE ONLY) --

aide -i -c <aide dir path>/aide.conf -L info

CURDATE=`date +%s`
mv <aide dir path>/init.dbgz <aide dir path>/${CURDATE}.dbgz
sed -i -r "s|^database_in=.*|database_in=file:<aide dir path>/${CURDATE}.dbgz|" <aide dir path>/aide.conf
sed -i -r 's|^database_out=.*|database_out=file:<aide dir path>/current-next.dbgz|' <aide dir path>/aide.conf

The additional commands using sed and mv is a mechanism to retain old aide databases and reports for the record which maybe useful if things go wrong and also to 'rotate' the database.

Verify/report --
aide -C -c <aide dir path>/aide.conf -L info
mv <aide dir path>/report-latest.log <aide dir path>/report-`date +%s`.log

Update command --
aide -u -c /home/de/docs/Architecture/Software/archive_RAID/aide.conf -L info
CURDATE=`date +%s`
mv /home/de/docs/Architecture/Software/archive_RAID/current-next.dbgz /home/de/docs/Architecture/Software/archive_RAID/${CURDATE}.dbgz
sed -i -r "s|^database_in=.*|database_in=file:/home/de/docs/Architecture/Software/archive_RAID/${CURDATE}.dbgz|" /home/de/docs/Architecture/Software/archive_RAID/aide.conf
mv /home/de/docs/Architecture/Software/archive_RAID/report-latest.log /home/de/docs/Architecture/Software/archive_RAID/report-${CURDATE}.log

List contents in aide database --
aide --list -c /home/de/docs/Architecture/Software/archive_RAID/aide.conf -L info

To assemble --

First assemble RO always and disassemble in case your work was only for reading. Then check /proc/mdstats if a resync is pending and plan appropriately if it is pending (See 'In case resync=pending --'). You can keep the RAID array in RO, but NEVER write to it until a resync is complete.

Occasionally rewrite all of the disks to prevent a bitrot using command --
dd=/dev/sd... of=/dev/sd... bs=1M conv=notrunc iflag=fullblock
This command is safe to run even if there is a power loss (I've tested this).
Next assemble the array RO and verify using aide and Run 'Force fsck -- ', then proceed to the next disk.

Do a full SMART tests occasionally using smartctl

'Force fsck --' occasionally.

mdadm -A -o /dev/md/archive /dev/mapper/raid-*

-o is optional to assemble the array RO

Check /proc/mdstat for issues

mount -o logbufs=8,logbsize=256k,noquota,noatime,ro /dev/md/archive /mnt/archive

,ro is optional to mount the FS readonly.

Occasionally using aide sometimes.

Intermittently check 'Force fsck -- '

To write data --

I had done power failure tests to ensure that the existing data remains intact in even of a powerloss during a write operation.

Regenerate database using aide after getting the archive to RO; this will also generate a report. In the report, check for changed checksums of old files; in case of changes, start a recovery operation.

To stop the RAID array --

Check dmesg before stopping for issues (like corruption)

mdadm -S /dev/md/archive

for i in <your identified disk names in /dev/disk/by-id/, space separated>; do cryptsetup remove raid-$i; done

Mount a crashed XFS FS RO (without replaying the journals) --

This is useful in case you want to read your data quickly but the XFS FS will not mount without some long recovery operation.

mount -o logbufs=8,logbsize=256k,noquota,noatime,ro,norecovery /dev/md/archive /mnt/archive

In case resync=pending --

This message can be seen in /proc/mdstats.

To recover from this status mount array RO

Check database using aide (see above commands).
If everything is correct, resync the RAID array (this can be done by normally mounting the RAID array RW). NOTE: this'll take a long time.

I've tested with power failure during a resync operation and does not cause any data loss.

If database is incorrect, follow 'Action plan after failed inconsistency test (as per aide or mdadm --check) --'

To replace a failed disk --

This procedure is resilient to sudden powerloss.

Check if existing data is consistent using aide.
mdadm -S /dev/md/archive

Execute 'Force assemble a failed array (rw) -- '

Setup the new disk using cryptsetup

mdadm /dev/md/archive --add-spare <new disk /dev/mapper/raid-<new disk name as setup in cryptsetup>

Check resync process.

mdadm /dev/md/archive -G -n <expected no. of devices in the array>

To replace a disk (for e.g. you determined that it's about to fail) --

powerloss during rebuilding/recovery was tested and it does not cause any data loss as per the tests.

1st assemble the array and then check using aide

mdadm /dev/md/archive --fail <device which is about to fail Device must be /dev/mapper/raid-*>

mdadm -S /dev/md/archive

execute 'Force assemble a failed array (rw) -- '

Check using aide.

Setup the new disk using cryptsetup

mdadm /dev/md/archive -a <new disk /dev/mapper/raid-<new disk name as setup in cryptsetup>

Stop the array once recovery is complete and poweroff the device.

Verify using aide

Run 'Force fsck -- '

To add a new disk --

powerloss during rebuilding/recovery/reshape operation was tested and it does not cause any data loss as per the tests.

Update aide database

1st assemble the array

Setup new disk using cryptsetup

mdadm /dev/md/archive -a <new disk /dev/mapper/raid-<new disk name as setup in cryptsetup>

mdadm /dev/md/archive -G -n <increase count by 1>

Wait for resync to complete

verify using aide.

Force assemble a failed array (ro) --

mdadm -A -o -f -R /dev/md/archive /dev/sde /dev/sdd /dev/sdc

Force assemble a failed array (rw) --

mdadm -A -f -R /dev/md/archive /dev/sde /dev/sdd /dev/sdc

Force fsck (mdadm level) --

mdadm --action=check /dev/md/archive
dmseg and /sys/block/md<int>/md/mismatch_cnt (non 0 value) to check for issues.
NOTE: This takes a long time.

Action plan after failed inconsistency test (as per aide or mdadm --check) --

Verify the corruption --
mdadm --action=check /dev/md/archive

Start the array using commands 'Force assemble a failed array (ro) -- ' by removing 1 block device at a time and check data using aide. Once you've found the culprit drive (i.e. by removing the drive aide does not complaint), remove it using procedures 'To replace a disk (for e.g. you determined that it's about to fail) -- ' and then as per your findings, you may add a new HDD or add the removed drive after destroying data in it (after verifying that the HDD is fixed) using the same commands as in 'To replace a disk (for e.g. you determined that it's about to fail) -- ' (i.e. continue after removal procedures).

get details of an array --

mdadm -D /dev/md/archive

Wednesday, August 13, 2025

Gentoo x86_64/amd64 on x86_64/amd64 crossdev guide.

You got an underpowered amd64 machine on which you want to install Gentoo. But instead of waiting for it for days to compile libreoffice or chromium, you wish you could compile it on a faster desktop. But the faster desktop is binary incompatible with the slower machine -- it does not support instructions like that of the avx512 family, gfni, hle, pku, sgx, vaes, vpclmulqdq etc... etc... etc.... but you want those instructions on your underpowered machine because that's the Gentoo way, and underpowered machine is the one where you would like to have extra speed of -march=native.

NOTE: QT and dependent packages (like KDE) are headache. If you're building QT packages, don't bother with this setup. chrooting is better. Disable the extra instruction on a per package basis (package.env) my modifying CFLAGs with things like -mno-avx512f -mno-vpclmulqdq etc...

NOTE: cross-building support for gentoo ebuilds is like a 2nd class citizen. 1st class being building on the machine that you're using. Reason being there are hardly any users using crossdev and upstream hardly cares about crossbuilding. You can use distcc, but then there is no support for rust, java etc... and fails to compile remotely for various reasons. Also if you're an LTO user, then linking will be done on the slow machine and with LTO it take exceptionally long to link. For rust there is sccache also, but I could not get the distributed compiling to work even for firefox on Gentoo.

First you need to readup upon the these 2 guides before proceeding into this one --

https://wiki.gentoo.org/wiki/Embedded_Handbook/General/Creating_a_cross-compiler

https://wiki.gentoo.org/wiki/Crossdev

This setup deviates from these 2 guides by the fact that these guides are essentially incomplete. Because although they provide you the procedures to build a working Gentoo install, they're mum upon how to actually use it on the target machines and what to do when you want to update it after 'using' the gentoo OS on the target machine. Also gentoo setup that crossdev makes are not exactly 'independent' -- in a sense that they may not have a working toolchain and rely completely on the builder machine for building all the packages.

The 'unconventional' setup --

This setup creates completely independent Gentoo installs which can compile on their own (i.e. has a working toolchain).

The idea here is to copy over your working gentoo install (or maybe even stage3) to /usr/<archspec> (where <archspec> is like x86_64-<chost>-linux-gnu, i.e. of your choice), do operations on it like updating, installing packages etc... and transfer it to your actual machine so it can boot from it. Of course for building a new package, you don't need to transfer the /usr/<archspec> back and forth; you can use gentoo binpkg support for that. This way you can build a binary package on your powerful desktop and transfer the binary package to the other machine.

Now crossdev has some issues with this setup. It assumes that /usr/<archspec> is empty when you install a toolchain to /usr/<archspec> otherwise building it fails with some strange errors. Once the toolchain is installed, it works with no problem even if you mangle it's contents /usr/<archspec> with the weak machine's rootfs. What you can do to work around over this is instead of copying over your weak machine's rootfs to /usr/<archspec>, you mount --bind it from some other place after the crossdev's <archspec> toolchain has been built. Just ensure that if you ever built the weak machine's toolchain again, just umount /usr/<archspec>.

Because this is amd64 on amd64, the <chost> defaults to 'pc'. So when building your toolchain using crossdev, set <chost> to something which is NOT pc. For the same reason, in /usr/<archspec>, you've to also change the CHOST maybe even for your builder machine because -- it's your choice. You can follow this guide for that. you can do this in chroot; if this is a stage 3 tar archive, you can safely chroot into it because it's generic as of now.

Also it's critically important to have a common portage tree between the build host and /usr/<archspec>, otherwise expect various issues here and there.

On multilib platforms, you've to enable mulitlib on the cross-toolchain built by crossdev. See this for detail. This essentially implies you've to append -A 'amd64 x86' to the crossdev command line options and unmask the multilib USE flag of the generated crossdev toolchain packages before building the toolchain for /usr/<archspec>.

CBUILD/CHOST concepts --

Now let's dive into the concepts of CBUILD/CHOST which is a toolchain concept used in the industry; this is needed to fix failed builds.

These are common across configure scripts (autoconf), but portage will 'translate' their meanings to have a similar affect across different toolchains.
CHOST is passed to the --host in the configure script (which is a part of the source code's build system) which implies 'cross-compile to build programs to run on CHOST' – i.e. the binaries produced will be intended to be run on <archspec> (same definition as in crossdev). The possible values of CHOST is also defined as <archspec>. The effect of this will be that portage will execute commands of the toolchain (gcc/llvm etc..) prefixed with <archspec>-* -- all this because gentoo assumes that binaries for <archspec> can only be produced by this compiler. No, changing the CHOST to make gentoo use another compiler (for e.g. the host's compiler when cross compiling) is NOT a good idea since the CHOST is also reflected in certain package's installed files and it also affects where the files are installed (because some paths have the CHOST in them); however these are mostly packages which belong to the toolchain or are system packages.

CBUILD – This is passed to configure script as –build=${CBUILD}; this implies the machine on which the building is happening. If CBUILD != CHOST it tells the toolchain that it's cross compiling and it's going to take a different route and avoid executing built binaries. However I believe that it's affect is only limited to determining if cross compiling should be activated or not. If you are cross compiling and CBUILD=CHOST, the compilation may fail if the build binaries (which have CFLAGS compatible for CHOST) are executed by the toolchain (for whatsoever reason) as a part of the build process ('helper programs' for completing the compilation process) and the CBUILD machine (build host) is incapable of running binaries produced for CHOST.

For a self independent machine, set set CBUILD=CHOST. When cross compiling, ensure to set the 2 different, although sometimes you can resolve compilation issues by setting CBUILD to CHOST (and not the other way around to prevent confusing your OS).

Changes required for crossdev's emerge

Now let's prepare the unconventional /usr/<archspec> so crossdev can be used for it. All you have to do is export the ROOT and PORTAGE_CONFIGROOT and set it to /usr/<archspec>. As a precaution also set SYSROOT to the same value. Now you may use <archspec>-emerge. For QT packages, export QT_HOST_PATH=/ will also help.

When packages fail to build --

As I said before, crossbuilt support for packages is 2nd class. You can expect a LOT of failures, but because this is x86 on x86 there are multiple workarounds over the crossdev specific build failure.

Method 1 --

Realize that you can cross build using plane emerge instead of <archspec>-emerge. It requires the following environment variables to be set --

export ROOT=/usr/<archspec>
export PORTAGE_CONFIGROOT=/usr/<archspec>

export SYSROOT=/usr/<archspec>
export PKG_CONFIG_PATH=/usr/<archspec>/usr/lib64/pkgconfig
export PKG_CONFIG_SYSROOT=/usr/<archspec>
export QT_HOST_PATH=/

export CBUILD=<archspec of the build host like x86_64-pc-linux-gnu>

export CHOST=<archspec>

You may try to build the package with these environment variables exported instead of <archspec>-emerge to see if it builds successfully; although in 99% of the cases this is just like crossdev.

Method 2 --

If you can chroot, then try to build the package in chroot. QT package will fail however.

Method 3 --

You may explore cb-emerge (called 'crossboss')

Method 4 --

Swap cbuild and chost's values as set in method 1. This will make the source code's build system cross-compile BUT using the build host's toollchain.

Method 5 --

Make changes to method 1's exported environment variables. Change the value of CBUILD to CHOST, i.e. CBUILD=<archspec>

This will make the build system of the package assume that you are NOT cross compiling. Any build failures because of the incapability of the package's build system to cross-compile or a bug in it will not be triggered. However this may cause traps: XXXXXXX trap invalid opcode in which case you've to play around with package.env to disable the incompatible instruction.

Method 6 --

Make changes to method 1), set CHOST=CBUILD, i.e. CHOST will be set to <archspec>; this will make portage use the build host's toolchain.

Method 7 --

Only build using ROOT and PORTAGE_CONFIGROOT set.

Method 8 --

Install build time dependencies on both /usr/<archspec> and the build host. These are very common bugs among packages. Sometimes you need to install run time dependencies on the build host also for the build to complete successfully. This is especially true when you've been using hacks like setting CBUILD=CHOST.

Method 9 --

Ensure build time dependencies are are identical on the build host and /usr/<archspec>. This is specifically a problem with dev-libs/protobuf.

Method 9 --

Even if you're cross compiling normally with crossdev, you may receive traps: XXXXXXXX trap invalid opcode. Here you've to change your package.env to iron out the incompatible instructions.

Java packages --

Only build with ROOT and PORTAGE_CONFIGROOT set. Otherwise you have to resolve dependency hell manually.

Multiple GCC versions

In case you need multiple GCC version, you can do that too using crossdev. Just specify a separate GCC version and it'll install it to a separate slot. To select the gcc version use eselect gcc.

Building binary packages

In case you've changed your CHOST for successfully building the package, in the Packages file in the PKGDIR will contain CHOST set to the build host which will prevent it from being installed on the the machine on which /usr/<archspec> is intended to be used. Therefore you've to modify the Packages file and change the CHOST to <archspec>.

LLVM/clang --

Unlike GCC where a gcc is build specification for a CPU architecture (sans the tuning of individual instructions) a single clang install can build binaries for multiple architecture while cross compiling. crossdev knows about this therefore it does not build a separate clang/lang toolchain. But when you are building for gcc, clang is unsupported anyway (it's never by crossdev), so in your gcc machine, if you want clang support for cross compiling, you need to create symlinks as such --

for i in `find /usr/lib/llvm/ -mount -name 'x86_64-<CHOST of build host>-linux-gnu-*'`; do ln -s `echo -n $i | grep -Po '[^/]+$'` `echo -n $i | grep -Po /.*/``echo -n $i | grep -Po '[^/]+$' | sed s/-<CHOST of build host>-/-<CHOST of /usr/<archspec>-/`; done

This is essentially going to search all clang/llvm related binaries in /usr/lib/llvm/ and create a symlink of <archspec> pointing to your build machine's binaries.

Monday, August 11, 2025

Gentoo benchmarks with vs without avx512(-mno-avx512)

So I had to migrate my crossdev setup which I was using to maintain my laptop to a chroot-maintained setup on my non-avx512 capable workstation. In order to do this, I had to disable avx512 for the whole laptop rootfs and rebuild gentoo; in the process I also updated Gentoo so it was not exactly an apples to apples comparison; but it shouldn't matter much because there was very less difference in the package versions.

XZ benchmarks (lower the better) --

Bash benchmarks (lower the better) --

Compare and ffmpeg (lower the better) --

OpenSSL --

The performance difference between compare could be seen between Debian vs gentoo benchmarks too. So this not a mistake.

This machine is an icelake-client laptop running an i3.

Many of the application may use assembly code. These application perform the same regardless of the of optimization applied by GCC. Common applications include openssl, various video codec libraries, prime95 etc... but I'm not entirely sure how much of assembly they're using; this is the reason why I chose sparsely used algos in openssl for benchmark purposes since the developer is less likely to do efforts for a less used algo.

Many applications are not bottlenecked by the CPU, even though it may seem so, that's because they put more stress on the memory speeds than the CPU. Even when the memory is the bottleneck, the CPU utilization is reported as 100% because of how closely the memory and CPU work. e.g. is compression workloads. In these benchmarks, there will not be much of a difference.

For the source of the benchmark download from here. These are it's contents --

script.sh -- The script which was run for the benchmark

shell-bench.sh -- Grep and bash benchmark script.

All outputs of scripts.

Tuesday, August 5, 2025

Integrating sccache-dist in portage/gentoo for rust distributed compiling.

NOTE: it did not work out. The (remote) sccache-dist server error out with --

Missing output path "/tmp/portage/www-client/firefox-128.12.0/work/firefox_build/instrumented/x86_64-unknown-linux-gnu/release/deps/fallible_iterator-9ab9b312481cd614.d"

The build.log you'll get --

Could not perform distributed compile, falling back to local: failed to rewrite outputs from compile: No outputs matched dep info file /tmp/portage/www-client/firefox-128.12.0/work/firefox_build/instrumented/release/deps/unicode_ident-9a905afffc6beb9c.d

And the compile happen locally, not at the remote sccache-dist server.

The main difference between my approach and this article is that I rely on RUSTC_WRAPPER (and cargo) to pass on the compilation process to the remote build server where as this article believes in creating symlink system after which sccache must work for all packages, not only just rust which was something I was not trying to achieve.

Also the toolchain will be copied over from the client (the machine which is actually initiating the compiling) to the build server. In case the toolchain's binaries are not compatible with the build server (which IS probably the case with most of gentoo's toolchain, but not with rust-bin), it won't workout. The toolchain binaries will error out on the build server.

Regardless, let's begin.

First sccache must be build with dist-client dist-server on both the client and build server.

Add the following to make.conf of the client machine --

RUSTC_WRAPPER=/usr/bin/sccache
SCCACHE_MAX_FRAME_LENGTH=104857600
SCCACHE_IGNORE_SERVER_IO_ERROR=0
SCCACHE_DIR=/var/tmp/sccache
SCCACHE_CACHE_SIZE=5G
SCCACHE_CONF=/etc/sccache-client.toml
FEATURES="-ipc-sandbox -network-sandbox -network-sandbox-proxy -pid-sandbox"

Create the client config on the client machine --

[dist]
scheduler_url = "http://<IP address of you build server>:64888"
toolchains = []
cache_dir = "/var/tmp/sccache-dist"
[dist.auth]
type = "token"
token = "<a plane text secret>"

And save it as /etc/sccache-client.toml

chown portage:portage /var/tmp/sccache-dist /var/tmp/sccache

Create scheduler config on the builder machine --

public_addr = "0.0.0.0:64888"
[client_auth]
type = "token"
token = "<a plane text secret>"
[server_auth]
type = "jwt_hs256"
secret_key = "<generate using `sccache-dist auth generate-shared-token`"

Assuming file is saved as /etc/sccache-sched.toml.

Run the scheduler (as user) --

SCCACHE_NO_DAEMON=1 sccache-dist scheduler --config /etc/sccache-sched.toml

Create build server config --

public_addr = "<IP address of you build server>:8889"
scheduler_url = "http://<IP address of you build server>:64888"
cache_dir = "/var/tmp/sccache_server_toolchain/"
[builder]
type = "overlay"
build_dir = "/var/tmp/sccache_builddir/"
bwrap_path = "/usr/bin/bwrap"
[scheduler_auth]
type = "jwt_token"
token = "<generate using `sccache-dist auth generate-jwt-hs256-server-token --server <IP address of you build server>:8889 --config /etc/sccache-sched.toml `"

Assume config location as /etc/sccache-build.toml.

Run the build server (as root) --

SCCACHE_NO_DAEMON=1 sccache-dist server --config /etc/sccache-build.toml

Thursday, June 19, 2025

Impact of mdadm -c, --chunk on random read/write performance and disk space utilization.

No one knows exactly what this is in context of mdadm, but this must be the minimum i/o size of the RAID block device. Regardless, I did some random read/write tests using various chunk sizes using seekmark. mdadm RAID creation parameters --

mdadm -C /dev/md/test -l 5 --home-cluster=xxx --homehost=any -z 10G -p left-symmetric -x 0 -n 3 -c 512K|64K --data-offset=8K -N xxxx -k resync

XFS format parameters --

mkfs.xfs -m rmapbt=0,reflink=0

Seekmark commands --

seekmark -i $((32*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((64*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((128*1024)) -t 1 -s 1000 -f /mnt/archive/test-write
seekmark -i $((256*1024)) -t 1 -s 1000 -f /mnt/archive/test-write

512K chunks --

seekmark 32K: 163.64
seekmark 64K: 153.89
seekmark 128K: 145.77
seekmark 256K: 130.16

64K chunks --

seekmark 32K: 145.33
seekmark 64K: 133.40
seekmark 128K: 121.04
seekmark 256K: 99.60

Unit is seeks/sec

Therefore, for some reason 512K chunks win even for small reads.

For 32K writes, I was getting Around 53 seeks/s write using 512K chunks and 49 seeks/s for 64K chunks, so here too large chunk size wins by a small margin (and maybe there no difference at all).

For the disk space utilization, large chunk size too wins when used with the same underlying xfs FS. For the test, 400000 4K sized files where created. At 4K chunk size 1.9G of space was used and at 16K chunk size, 1.8G space was used.

Tuesday, June 10, 2025

mdadm (RAID 5) performance under different parity layouts (-p --parity --layout)

While performance of right-asymmetric, left-asymmetric, right-symmetric, left-symmetric, is roughly the same, the performance of parity-last and parity-first seems strikingly fast for reads.

Tests were done on a RAID 5 setup over 3 USB hard drivers each have 10TB capacity. Each HDD is capable of 250+ MBPS simultaneously (therefore the USB link is not saturated).

The optimal chunk size for right-asymmetric, left-asymmetric, right-symmetric, left-symmetric starts at 32KB where the sequential read speeds are around 475MB/s. At 256KB and 512KB chunks, the read speeds slightly improve to around 483MB/s. Below 32KB chunks, the read speeds suffer significantly where I get 120MB/s reads at 4K chunks. The write speeds are around 480MB/s even for 4KB chunks and remains the same even for 512KB chunks (no tests where done beyond this size).

With parity-last/first you can afford to have a lower chunk size with the same read performance. For e.g. at 16K chunks, I was getting writes of 488MB/s and reads of 478MB/s writes. However the best lowest chunk size was 32K where I was getting 490MB/s writes and 506MB/s reads. The performance remained the same upto 512K chunk size. Therefore in a 3 disk RAID-5 setup, parity-last/first gives the optimal performance at a lower chunk size (compared to other parity layouts) which MUST be a good deal, however as per other tests done both lower chunk size and parity-last/first is not a good idea.

The problem with parity-last/first is that the writes do not scale beyond 2 data disks (i.e. 3 disks in total), which was a RAID-4 problem and parity-last/first IS a raid 4 layout. Technically, the random writes must not scale and it must not impact sequential writes, but it seems it does not scale even for sequential writes. Synthetic tests where done by starting a VM in qemu with 5 block devices, each of which was throttled to 5MB/s. These are the tests done (with 5 disks) --

create qemu-storage --
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage1.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage2.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage3.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage4.qcow2 20G
qemu-img create -f qcow2 -o lazy_refcounts=on RAID5-test-storage5.qcow2 20G

Launch qemu --

qemu-system-x86_64 -machine accel=kvm,kernel_irqchip=on,mem-merge=on -drive file=template_trixie.raid5.qcow2,id=centos,if=virtio,media=disk,cache=unsafe,aio=threads,index=0 -drive file=RAID5-test-storage1.qcow2,id=storage1,if=virtio,media=disk,cache=unsafe,aio=threads,index=1,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage2.qcow2,id=storage2,if=virtio,media=disk,cache=unsafe,aio=threads,index=2,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage3.qcow2,id=storage3,if=virtio,media=disk,cache=unsafe,aio=threads,index=3,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage4.qcow2,id=storage4,if=virtio,media=disk,cache=unsafe,aio=threads,index=4,throttling.bps-total=$((5*1024*1024)) -drive file=RAID5-test-storage5.qcow2,id=storage5,if=virtio,media=disk,cache=unsafe,aio=threads,index=5,throttling.bps-total=$((5*1024*1024)) -vnc [::1]:0 -device e1000,id=ethnet,netdev=primary,mac=52:54:00:12:34:56 -netdev tap,ifname=veth0,script=no,downscript=no,id=primary -m 1024 -smp 12 -daemonize -monitor pty -serial pty > /tmp/vm0_pty.txt

mdadm parameters for parity-last/first --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p parity-last -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

mdadm parameters for left-symmetric --

mdadm -C /dev/md/bench -l 5 --home-cluster=archive10TB --homehost=any -z 1G -p left-symmetric -x 0 -n 5 -c 512K --data-offset=8K -N tempRAID -k resync /dev/disk/by-path/virtio-pci-0000:00:0{5..9}.0

Read test --
dd if=/dev/md/bench of=/dev/null bs=1M count=100 iflag=direct

For the writes, I was getting getting 10MB/s with parity-last and 13.4MB/s left-symmetric (34% higher).

For reads I was getting 21.8MB/s with parity-last and 27.6MB/s with left-symmetric

Therefore it seems left-symmetric was scaling better in every way.

To ensure nothing was wrong with the test setup, I repeated the same test for parity-last/first with 3 disks instead and I was getting 10.7MB/s writes and 10.7MB/s reads.

With this I come to the conclusion, that parity-last/first scales for writes for at best 2 disks in the best case scenario. Yes, agree I was getting a little extra speed for reads with left-symmetric with 5 disks (because theoretically it must be upto 20MB/s), but why did it happen exactly is beyond my understanding.

As of why smaller chunk size is not a good idea, I'll write about that in another blog post.

Wednesday, May 28, 2025

Re-writing HDDs to avoid bitrot/degradation under linux.

Over the years your archival HDDs are susceptible to bitrot. You've to re-write them on regular intervals to prevent that. You can use dd for that --

dd=/dev/sdX of=/dev/sdX bs=1M conv=notrunc iflag=fullblock

This is even resilient to power failures (I tested that over a VM).

Tuesday, April 29, 2025

ffmpeg: Audio/video out of sync in ffmpeg when frame rate limit is set using -r.

If you've specified -r at the input, you may like to try moving it before the -vcodec to resolve the issue. With this change, the input is not frame limited, but the encoding is frame limited.

Ext4 vs xfs (with and without rmapbt) massive small file operations benchmark

Methodology

/mnt/tmpfs/ contains trimmed linux sources. Large files where removed to reduce the size of the total storage to 5GB. /mnt/tmpfs/ is a tmpfs filesystem.

The following are the benchmarks done --

Copy operation --
time cp -a /mnt/tmpfs/* /mnt/temp/
Cold search --
time find /mnt/temp/ -iname '*a*' > /dev/null
Warm search --
time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
read all files in an alphabetic way (cold) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
read all files in an alphabetic way (warm) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
cd /mnt/temp/
find /mnt/temp/ -type f > /tmp/flist.txt
dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
time write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
Delete dir tree --
time rm -rf /mnt/temp/*

HDD benchmarks

mount and mkfs options

mount paramters for xfs - -

mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 --

mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc

nodelalloc was removed since bigalloc was removed.

ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 --

mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096

bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).

mkfs.xfs -f -m rmapbt=0,reflink=0

Results --

ext4 --
Create/copy --
0m27.925s
Cold search --
0m0.157s
Warm search --
0m1.509s
read all files in an alphabetic way (cold) (parallel) --
0m0.253s
read all files in an alphabetic way (warm) (parallel) --
0m0.252s
Write a certain small value to all files alphabetically in parallel --
11m41.727s
Delete dir tree --
0m1.161s

xfs --
Create/copy --
0m21.857s
Cold search --
0m0.081s
Warm search --
0m0.752s
read all files in an alphabetic way (cold) (parallel) --
0m0.239s
read all files in an alphabetic way (warm) (parallel) --
0m0.238s
Write a certain small value to all files alphabetically in parallel --
11m43.711s
Delete dir tree --
0m1.086s

Conclusion --

Despite rmapbt being disabled in XFS (which improves performance with small files), XFS is faster than ext4 in most tests. If this ext4 FS (which is optimized for large files) is used for operations on large files, expect lower performance.

SSD benchmarks

mount and mkfs options

blkdiscard done before each benchmark.

mount paramters for xfs - -

mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 --

mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc

nodelalloc was removed since bigalloc was removed.

ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 --

mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096

bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).

xfs with no rmapbt --

mkfs.xfs -f -m rmapbt=0,reflink=0

xfs with rmapbt --

mkfs.xfs -f -m rmapbt=1,reflink=0

Results --

ext4 --
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
        real    0m48.826s
        user    0m0.204s
        sys     0m3.005s

        real    0m48.290s
        user    0m0.246s
        sys     0m2.898s

    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
        real    0m0.172s
        user    0m0.074s
        sys     0m0.097s

        real    0m0.169s
        user    0m0.064s
        sys     0m0.105s

    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
        real    0m1.616s
        user    0m0.536s
        sys     0m1.075s

        real    0m1.651s
        user    0m0.615s
        sys     0m1.031s

    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.444s
    user    0m0.227s
    sys     0m2.850s

    real    0m0.402s
    user    0m0.271s
    sys     0m2.793s

    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.407s
    user    0m0.230s
    sys     0m2.851s

    real    0m0.402s
    user    0m0.223s
    sys     0m2.845s

    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    9m59.305s
    user    9m53.748s
    sys     0m51.903s

    real    9m38.867s
    user    9m33.476s
    sys     0m49.930s

    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m0.824s
    user    0m0.021s
    sys     0m0.743s

    real    0m0.820s
    user    0m0.038s
    sys     0m0.718s
xfs rmapbt=0
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m14.851s
    user    0m0.298s
    sys     0m3.860s

    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.054s
    sys     0m0.027s


    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.694s
    user    0m0.511s
    sys     0m0.179s

    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.277s
    sys     0m2.680s


    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.388s
    user    0m0.256s
    sys     0m2.705s


    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    10m45.878s
    user    10m40.476s
    sys     0m7.636s

    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m1.181s
    user    0m0.030s
    sys     0m0.482s
xfs rmapbt=1
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m2.883s
    user    0m0.159s
    sys     0m2.556s


    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.049s
    sys     0m0.033s

    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.700s
    user    0m0.480s
    sys     0m0.216s

    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.218s
    sys     0m2.752s

    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.229s
    sys     0m2.739s

    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    8m53.297s
    user    8m48.394s
    sys     0m9.786s

    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m2.373s
    user    0m0.024s
    sys     0m0.498s

Conclusion --

When comparing xfs rmapbt=1 and xfs rmapbt=0, rmapbt=1 wins on average (but not by a large margin).

When comparing xfs rmapbt=1 and ext4, xfs wins by a large margin

Monday, April 28, 2025

Debian trixie vs Gentoo benchmark.

Recently I came across this benchmark, which although old, but laughable (if you don't know why, I suggest you either readup more about machine code or remain a happy Ubuntu user) because of the inaccurate benchmark method in regards to Gentoo.

Also at this time I just installed Debian trixie (still in testing) for another machine and realized that versions of various applications in their repositories where striking similar. So I decided to to also do a casual benchmark, which although is not that accurate, but FAR more than that phoronix benchmark.

Openssl (higher the better) --

Firefox https://browserbench.org/Speedometer2.1/ (higher the better) --

CPU and real run time of various CPU intensive applications (lower the better) --

xz real and CPU time taken(lower the better) --

bash script benchmark results (lower the better) --

The machine is a Ryzen 5 PRO 2600 -- which is an old machine (x86_64-v3 instruction set). The highest contrast with the benchmark must be seen with newer processors specially x86_64-v4 (avx512) ones because binary distributions (except clearlinux) are optimized for x86_64 baselines which is 3 generations behind the latest. In short you're not fully utilizing your shiny new x86_64-v4 processors unless you use Gentoo. In these matters, even Windows is better off because it's hefty 'minimum requirement' just for running the OS implies they can compile binaries above the baseline x86_64 instruction set.

As of now, I'm not able to get chromium to run on Gentoo because of the GPU of the machine has been blacklisted as per chrome. It works on Intel platform though.

imagemagick's compare was able to run on all 12 CPUs on Debian, but only 2 CPUs on Gentoo. As a result, I limited the benchmark to 2 CPUs, however in this configuration, Debian's build of imagemagic took double the time to gentoo's. Because of the large difference I really doubt this is because of the optimization differences between the 2 builds. For larger images, gentoo's build is able to use all 12 CPUs, but since it was taking too much time (for both Debian and Gentoo) I abandoned it.

Package versions of Gentoo --

imagemagick-7.1.1.38-r2

bash-5.2_p37

openssl-3.3.3

firefox-128.8.0

ffmpeg-6.1.2-r1

xz-utils-5.6.4-r1

grep-3.11-r1

gcc - 14.2.1_p20250301 (all packages where built using this version. CFLAGS in make.conf where -march=znver1 --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=512 -fomit-frame-pointer -floop-interchange -floop-strip-mine -floop-block -fgraphite-identity -ftree-loop-distribution -O3 -pipe -flto=1 -fuse-linker-plugin -ffat-lto-objects -fno-semantic-interposition, however a few packages (like firefox) iron many of the CFLAGs out).

Package versions for Debian --

imagemagick-7.1.1.43+dfsg1-1

bash-5.2.37-1.1+b2

openssl-3.4.1-1

firefox-128.9.0esr-2

ffmpeg-7.1.1-1+b1

xz-utils-5.8.1-1

grep-3.11-4

gcc-14.2

The Debian is a fresh install, while the Gentoo installation is from 2009. Over the years, the same installation has been migrated/replicated across multiple machines. Debian was installed on a pendrive while Gentoo was installed on an SSD; of course disk i/o was noticed during the benchmark and only CPU was the bottleneck (there was no i/o wait). All data for the benchmark was loaded from an external HDD (here too disk i/o was not the bottleneck).

For the source of the benchmark download from here. These are it's contents --

script.sh -- The script which was run for the benchmark.

ff-bench_debian.png/ff-bench_gentoo.png -- Screenshot of FF benchmark (which of course the script did not run).

benchmark_results_debian.txt/result_gentoo.txt -- output of script.sh

shell_bench_Result_gentoo.txt/shell_bench_Result_gentoo.txt -- Output of shell-bench.sh on Gentoo/debian.

shell-bench.sh -- Grep and bash benchmark script.

Thursday, April 10, 2025

Debian trixie source.list (with unstable and experimental added) and corresponding apt pin configuration.

This is the /etc/apt/source.list --

deb http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ testing main contrib non-free non-free-firmware
deb http://deb.debian.org/debian/ trixie main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ trixie main contrib non-free non-free-firmware
deb http://deb.debian.org/debian/ trixie-updates main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ trixie-updates main contrib non-free non-free-firmware

deb http://deb.debian.org/debian/ trixie-backports main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ trixie-backports main contrib non-free non-free-firmware

deb http://deb.debian.org/debian/ trixie-proposed-updates main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ trixie-proposed-updates main contrib non-free non-free-firmware

# ensure there is no testing-security and remove these
deb http://security.debian.org/debian-security testing-security updates/main updates/contrib updates/non-free updates/non-free-firmware
deb-src http://security.debian.org/debian-security testing-security updates/main updates/contrib updates/non-free updates/non-free-firmware
deb http://security.debian.org/debian-security trixie-security updates/main updates/contrib updates/non-free updates/non-free-firmware
deb-src http://security.debian.org/debian-security trixie-security updates/main updates/contrib updates/non-free updates/non-free-firmware

deb http://deb.debian.org/debian/ unstable main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ unstable main contrib non-free non-free-firmware

deb http://deb.debian.org/debian/ sid main contrib non-free non-free-firmware
deb-src http://deb.debian.org/debian/ sid main contrib non-free non-free-firmware

deb http://www.deb-multimedia.org/ trixie main non-free
deb-src http://www.deb-multimedia.org/ trixie main non-free

# not available now, but will be later
deb http://www.deb-multimedia.org/ trixie-backports main non-free
deb-src http://www.deb-multimedia.org/ trixie-backports main non-free

deb http://www.deb-multimedia.org/ testing main non-free
deb-src http://www.deb-multimedia.org/ testing main non-free

deb http://www.deb-multimedia.org/ unstable main non-free
deb-src http://www.deb-multimedia.org/ unstable main non-free

deb http://www.deb-multimedia.org/ experimental main non-free
deb-src http://www.deb-multimedia.org/ experimental main non-free

This is the corresponding pin configuration (you may place this in /etc/apt/preferences.d/custom)

Package: *
Pin: release n=trixie-security
Pin-Priority: 996

Package: *
Pin: release n=trixie-updates
Pin-Priority: 995

Package: *
Pin: release n=trixie
Pin-Priority: 991

Package: *
Pin: release n=trixie-proposed-updates
Pin-Priority: 990

Package: *
Pin: release n=trixie-backports
Pin-Priority: 550

Package: *
Pin: release n=trixie,o=Unofficial Multimedia Packages
Pin-Priority: 600

Package: *
Pin: release a=testing
Pin-Priority: 140

Package: *
Pin: release a=unsable
Pin-Priority: 130

Package: *
Pin: release a=experimental
Pin-Priority: 120