Just one question. If I'm running a local model, can I do something other than just a context free grammar? Does it makes sense to have something more general, or it would be just too slow?
I guess the only hard constraint is to not have backtracking, right? To not waste previously emitted tokens
But actually getting that grammar right as well as actually making it work with the correct Jinja template to correctly enable thinking mode and parse it out was a lot more work than I expected.
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).