NVIDIA: 32 threads per warp - threads execute in lockstep with same instruction
AMD: 64 threads per wavefront (32 on RDNA) - similar to NVIDIA but larger groups
Apple Silicon: 32 threads per SIMD group