The busy delay is incurred after each and every write to the YM data port, not just KeyON/OFF.
I worked with @StephenHorn on the box16 emulator project while he was refactoring that emulator to use the new BSD-licensed YMFM core library. I've learned quite a bit about how this chip works.
R38 has several inaccuracies in the YM implementation - the core it uses does not emulate the busy behavior at all. You can bang away on it all day long and it won't skip a beat. The library just handles it. Furthermore, I don't think the IRQ functionality is emulated, and I also think the timer behavior might not be implemented either - I'll go re-test this and let y'all know...
Box16 has extended the YM support in several ways - it supports YM IRQs as well as busy flag behavior. Some of this has been a matter of the group's interpretation of other example code and/or the official YM2151 application manual. In other words, the current behavior in Box16 is our best guess as to the real behavior. Stephen has done an excellent job integrating this code and dealing with a colossal time synchronization nightmare, by the way. Kudos to his hard work! One of his decisions was to make these "enhancements" be disabled by default to maintain functional parity with the official emulator. They are activated by command-line arguments. Interestingly, when testing the accuracy improvements, one of my VGM playback routines went nuts and played an entire song back in less than 1.5 seconds. It turns out that the original VGM data stream contains commands that enable YM IRQs. My player just used WAI as a poor man's IRQ handler to wait for VSYNC. Thus it mistook these YM IRQs as VBLANK IRQs. The player was built to read the busy flag though, and even with strict enforcement of the busy state, the player doesn't drop any updates, so that seems to work properly. After removing the IRQ enable messages from the VGM, it plays back perfectly on Box16 even when it enforces busy flags.
The big question is how will this work on the real thing? As I said above, the behavior is currently our best guess based on other sources, none of which being real hardware.
One thing I can say is that @SlithyMatt's Chase Vault game works on real HW, and the game's YM routine uses busy flag reads to ensure that it does not write to the chip when it's busy - so that's a good sign. Beyond that, I've sent a basic read test program to Kevin, as the YM reading functionality has never been subjected to testing on X16 hardware the way writing has been. So far, the "hello world" results are interesting. Kevin's only been able to test at 2 and 4MHz which the math says should work just fine on YM. 8MHz is the really important one, but as the system is currently experiencing other 8MHz-related problems, Kevin wasn't able to test at 8MHz.
In short - the test program successfully read IRQ status flags w/o any errors at 2 and 4 MHz. Interestingly, the busy flag read test never observed the flag's having been set, but I think there might've been a bug in my code.
I'm personally convinced that the write delay specification is actually a little bit different in reality than you might expect from reading the application manual. That is, I strongly suspect that the "64 YM clock delay" is not a minimum but a maximum. The data sheet definitely has other errata, and I'm convinced this is just another example.
I have a real YM to test with, but don't have everything lined up enough to be able to put it on a breadboard and do real testing just yet. If I can get that done, I'm definitely going to poke around with the real chip's busy state flag to see what makes it tick.
YM and write performance / overhead:
As for writing performance with the YM, I'd like to point out that in my experience with generating byte streams of PCM writes and FM writes, the FM music data streams are much smaller than the PSG streams. This is completely to be expected. That's because the FM chip does so much of the modulation in hardware that you don't need to update it nearly as often to get decent-sounding music. PSGs must be constantly modified for every little thing, be it pulse width modulation, vibrato, etc - all of which is done in hardware on the FM chips. Consequently, you end up performing SIGNIFICANTLY fewer writes to an FM chip than you do a PSG over the course of a tune. Still, sitting around for ~144 clock cycles waiting on the chip to finish chewing the last bite is not ideal.
I believe one way to handle YM writes may be to queue up the writes into a ring buffer, and have an IRQ that empties the ring buffer. You can set the YM's timers to run a little slower than the actual rate the YM could drain the buffer in order to cut down on the IRQ overhead. This would have the effect of spreading the load evenly throughout a frame if that's what you'd like to do. Kevin has confirmed that the YM2151's IRQ line is indeed connected to the system, so this should be doable on the real system.