A few days ago, I managed to port RustPython to BeOS, using w2c2 to convert the wasi binary of RustPython to C, and then compiling that C into a BeOS binary. (Does anyone know a way to get a cross-compiler to BeOS? I could not build with optimisations on a VM.) [Here](https://0x0.st/Hk_d.be.7z) is my prebuilt BeOS binary.
I am currently considering moving a framework I built from Python to Rust to make it faster and take advantage of all the Rust safe features. However, one of my requirements is to still allow users to use Python code, thus, I was thinking about using RustPython for that. I have been doing basic experiment but I would like to ask if anyone has done that before, and the limitations you found on the road. I have read somewhere that RustPython now seems to support pip packages, but I am also not sure about the limitations of it.
Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.
- Are coroutines viable for high-performance work? - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution? - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE? - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD? - What's the throughput gap between CPU and GPU Tensor Cores (TCs)? - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer? - Which parts of the standard library hit performance hardest? - How do error-handling strategies compare overhead-wise? - What's the compile-time vs. run-time trade-off for lazily evaluated ranges? - What practical, non-trivial use cases exist for meta-programming? - How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets? - How close are we to effectively using Networking TS or heterogeneous Executors in C++? - What are best practices for propagating stateful allocators in nested containers, and which libraries support them?
These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.
Some fun observations:
- Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
- Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
- The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
- In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
- Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
- Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.
The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(
Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!
Hi HN, we build an open-source operating system extension for orchestrating robot swarms fully decentralized.
This first beta version allows you to create fully decentralized robot swarms. The system will set up a wireless mesh network and run a p2p networking stack on top of it, such that nodes can interact with each other through various abstractions using our SDKs (Rust, Python, TypeScript) or a CLI.
We hope this is a step toward better inter-robot communication (and a fun project if you have a few Raspberry Pis lying around).
Our mesh network is created by B.A.T.M.A.N.-adv and we’ve combined this with optimized decentralized algorithms. To a user, it becomes very easy to write decentralized applications involving several peers since we’ve abstracted away much of the complexity. Our system currently offers several orchestration primitives (Key-Value Store, Pub-Sub, Discovery, Request-Response, Mesh Inspection, Debug Services, etc.)
Internally, everything except the SDKs is written in Rust, building on top of libp2p. We use gRPC to communicate between the SDKs and the CLI, so libraries for other languages are possible, and we welcome contributions (or feedback).
The C++ SDK and a ROS package that should feel natural to roboticists are in the works. Soon we also want to support a collaborative SLAM and a distributed task queue.
(Does anyone know a way to get a cross-compiler to BeOS? I could not build with optimisations on a VM.)
[Here](https://0x0.st/Hk_d.be.7z) is my prebuilt BeOS binary.