Tuesday, April 29, 2025

ffmpeg: Audio/video out of sync in ffmpeg when frame rate limit is set using -r.

 If you've specified -r at the input, you may like to try moving it before the -vcodec to resolve the issue. With this change, the input is not frame limited, but the encoding is frame limited.

Ext4 vs xfs (with and without rmapbt) massive small file operations benchmark

 Methodology

/mnt/tmpfs/ contains trimmed linux sources. Large files where removed to reduce the size of the total storage to 5GB. /mnt/tmpfs/ is a tmpfs filesystem.

The following are the benchmarks done --
Copy operation --
time cp -a /mnt/tmpfs/* /mnt/temp/
Cold search --
time find /mnt/temp/ -iname '*a*' > /dev/null
Warm search --
time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
read all files in an alphabetic way (cold) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
read all files in an alphabetic way (warm) --
time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
cd /mnt/temp/
find /mnt/temp/ -type f > /tmp/flist.txt
dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
time write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
Delete dir tree --
time rm -rf /mnt/temp/*

HDD benchmarks

mount and mkfs options

mount paramters for xfs - -
mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 -- 
mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc
nodelalloc was removed since bigalloc was removed.
ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 -- 
mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096
bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).
 
mkfs.xfs -f -m rmapbt=0,reflink=0

Results -- 

ext4 --
Create/copy --
0m27.925s
Cold search --
0m0.157s
Warm search --
0m1.509s
read all files in an alphabetic way (cold) (parallel) --
0m0.253s
read all files in an alphabetic way (warm) (parallel) --
0m0.252s
Write a certain small value to all files alphabetically in parallel --
11m41.727s
Delete dir tree --
0m1.161s

xfs --
Create/copy --
0m21.857s
Cold search --
0m0.081s
Warm search --
0m0.752s
read all files in an alphabetic way (cold) (parallel) --
0m0.239s
read all files in an alphabetic way (warm) (parallel) --
0m0.238s
Write a certain small value to all files alphabetically in parallel --
11m43.711s
Delete dir tree --
0m1.086s

Conclusion -- 

Despite rmapbt being disabled in XFS (which improves performance with small files), XFS is faster than ext4 in most tests. If this ext4 FS (which is optimized for large files) is used for operations on large files, expect lower performance.

SSD benchmarks

mount and mkfs options

blkdiscard done before each benchmark.
 
 
mount paramters for xfs - -
mount -o logbufs=8,logbsize=256k,noquota,noatime

mount parameters for ext4 -- 
mount -o noatime,data=writeback,journal_async_commit,inode_readahead_blks=32768,max_batch_time=10000000,i_version,noquota,delalloc
nodelalloc was removed since bigalloc was removed.
ext4 is optimized for small + large files. It shouldnt make a difference in performance.

format parameters for xfs and ext4 -- 
mkfs.ext4 -g 256 -G 4 -J size=100 -m 1 -O none,extent,flex_bg,has_journal,large_file,^uninit_bg,dir_index,dir_nlink,^sparse_super,^sparse_super2 -i 4096
bigalloc had to be removed because of large no. of inodes (Expect worst performance with larger files, which this benchmark does not cover).
 
xfs with no rmapbt --
mkfs.xfs -f -m rmapbt=0,reflink=0

xfs with rmapbt -- 
mkfs.xfs -f -m rmapbt=1,reflink=0

Results -- 

ext4 --
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
        real    0m48.826s
        user    0m0.204s
        sys     0m3.005s
        
        real    0m48.290s
        user    0m0.246s
        sys     0m2.898s

    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
        real    0m0.172s
        user    0m0.074s
        sys     0m0.097s
        
        real    0m0.169s
        user    0m0.064s
        sys     0m0.105s
        
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
        real    0m1.616s
        user    0m0.536s
        sys     0m1.075s
        
        real    0m1.651s
        user    0m0.615s
        sys     0m1.031s
        
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.444s
    user    0m0.227s
    sys     0m2.850s
    
    real    0m0.402s
    user    0m0.271s
    sys     0m2.793s
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.407s
    user    0m0.230s
    sys     0m2.851s
    
    real    0m0.402s
    user    0m0.223s
    sys     0m2.845s
    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    9m59.305s
    user    9m53.748s
    sys     0m51.903s
    
    real    9m38.867s
    user    9m33.476s
    sys     0m49.930s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m0.824s
    user    0m0.021s
    sys     0m0.743s
    
    real    0m0.820s
    user    0m0.038s
    sys     0m0.718s
xfs rmapbt=0
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m14.851s
    user    0m0.298s
    sys     0m3.860s
    
    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.054s
    sys     0m0.027s
    
    
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.694s
    user    0m0.511s
    sys     0m0.179s
    
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.277s
    sys     0m2.680s
    
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.388s
    user    0m0.256s
    sys     0m2.705s

    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    10m45.878s
    user    10m40.476s
    sys     0m7.636s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m1.181s
    user    0m0.030s
    sys     0m0.482s
xfs rmapbt=1
    Copy operation --
    time cp -a /mnt/tmpfs/* /mnt/temp/
    real    0m2.883s
    user    0m0.159s
    sys     0m2.556s

    
    Cold search --
    time find /mnt/temp/ -iname '*a*' > /dev/null
    real    0m0.082s
    user    0m0.049s
    sys     0m0.033s
    
    Warm search --
    time for i in {a..j}; do find /mnt/temp/ -iname "*$i*" > /dev/null; done
    real    0m0.700s
    user    0m0.480s
    sys     0m0.216s
    
    read all files in an alphabetic way (cold) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.218s
    sys     0m2.752s
    
    read all files in an alphabetic way (warm) --
    time find /mnt/temp/ -type f | xargs -d $'\n' -r -P 100 -n 300 -L 300 cat > /dev/null
    real    0m0.389s
    user    0m0.229s
    sys     0m2.739s
    
    Write a certain small value to all files alphabetically (check for CPU utilization too of the script) --
    cd /mnt/temp/
    find /mnt/temp/ -type f > /tmp/flist.txt
    dd if=/dev/urandom of=/tmp/write_data bs=1K count=6
    time /home/de/small/docs/Practice/Software/ruby/write_mulitple_files.rb /tmp/flist.txt /tmp/write_data
    real    8m53.297s
    user    8m48.394s
    sys     0m9.786s
    
    Delete dir tree --
    time rm -rf /mnt/temp/*
    real    0m2.373s
    user    0m0.024s
    sys     0m0.498s

Conclusion -- 

When comparing xfs rmapbt=1 and xfs rmapbt=0, rmapbt=1 wins on average (but not by a large margin).

When comparing xfs rmapbt=1 and ext4, xfs wins by a large margin

Monday, April 28, 2025

Debian trixie vs Gentoo benchmark.

Recently I came across this benchmark, which although old, but laughable (if you don't know why, I suggest you either readup more about machine code or remain a happy Ubuntu user) because of the inaccurate benchmark method in regards to Gentoo.

Also at this time I just installed Debian trixie (still in testing) for another machine and realized that versions of various applications in their repositories where striking similar. So I decided to to also do a casual benchmark, which although is not that accurate, but FAR more than that phoronix benchmark.

 Openssl (higher the better) -- 

 Firefox https://browserbench.org/Speedometer2.1/ (higher the better) -- 

 CPU and real run time of various CPU intensive applications (lower the better) -- 

 xz real and CPU time taken(lower the better) -- 

bash script benchmark results (lower the better) -- 


The machine is a Ryzen 5 PRO 2600 -- which is an old machine (x86_64-v3 instruction set). The highest contrast with the benchmark must be seen with newer processors specially x86_64-v4 (avx512) ones because binary distributions (except clearlinux) are optimized for x86_64 baselines which is 3 generations behind the latest. In short you're not fully utilizing your shiny new x86_64-v4 processors unless you use Gentoo. In these matters, even Windows is better off because it's hefty 'minimum requirement' just for running the OS implies they can compile binaries above the baseline x86_64 instruction set.

As of now, I'm not able to get chromium to run on Gentoo because of the GPU of the machine has been blacklisted as per chrome. It works on Intel platform though.

Many of the application may use assembly code. These application perform the same regardless of the of optimization applied by GCC. Common applications include openssl, various video codec libraries, prime95 etc... but I'm not entirely sure how much of assembly they're using; this is the reason why I chose sparsely used algos in openssl for benchmark purposes since the developer is less likely to do efforts for a less used algo.

May applications are not bottlenecked by the CPU, even though it may seem so, that's because they put more stress on the memory speeds than the CPU. Even when the memory is the bottleneck, the CPU utilization is reported as 100% because of how closely the memory and CPU work. e.g. is compression workloads. In these benchmarks, there will not be much of a difference.

imagemagick's compare was able to run on all 12 CPUs on Debian, but only 2 CPUs on Gentoo. As a result, I limited the benchmark to 2 CPUs, however in this configuration, Debian's build of imagemagic took double the time to gentoo's. Because of the large difference I really doubt this is because of the optimization differences between the 2 builds. For larger images, gentoo's build is able to use all 12 CPUs, but since it was taking too much time (for both Debian and Gentoo) I abandoned it.

Package versions of Gentoo -- 

imagemagick-7.1.1.38-r2

bash-5.2_p37

openssl-3.3.3

firefox-128.8.0

ffmpeg-6.1.2-r1

xz-utils-5.6.4-r1

grep-3.11-r1

gcc - 14.2.1_p20250301 (all packages where built using this version. CFLAGS in make.conf where -march=znver1 --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=512 -fomit-frame-pointer -floop-interchange -floop-strip-mine -floop-block -fgraphite-identity -ftree-loop-distribution -O3 -pipe -flto=1 -fuse-linker-plugin -ffat-lto-objects -fno-semantic-interposition, however a few packages (like firefox) iron many of the CFLAGs out).

Package versions for Debian -- 

imagemagick-7.1.1.43+dfsg1-1

bash-5.2.37-1.1+b2

openssl-3.4.1-1

firefox-128.9.0esr-2

ffmpeg-7.1.1-1+b1

xz-utils-5.8.1-1

grep-3.11-4

gcc-14.2

The Debian is a fresh install, while the Gentoo installation is from 2009. Over the years, the same installation has been migrated/replicated across multiple machines. Debian was installed on a pendrive while Gentoo was installed on an SSD; of course disk i/o was noticed during the benchmark and only CPU was the bottleneck (there was no i/o wait). All data for the benchmark was loaded from an external HDD (here too disk i/o was not the bottleneck).

For the source of the benchmark download from here. These are it's contents -- 

script.sh -- The script which was run for the benchmark.

ff-bench_debian.png/ff-bench_gentoo.png -- Screenshot of FF benchmark (which of course the script did not run).

benchmark_results_debian.txt/result_gentoo.txt -- output of script.sh

shell_bench_Result_gentoo.txt/shell_bench_Result_gentoo.txt -- Output of shell-bench.sh on Gentoo/debian.

shell-bench.sh -- Grep and bash benchmark script.