It can. It's something that is implemented by the thing that runs the model - eg...

nextaccountic · 2026-06-23T03:43:26 1782186206

Just one question. If I'm running a local model, can I do something other than just a context free grammar? Does it makes sense to have something more general, or it would be just too slow?

I guess the only hard constraint is to not have backtracking, right? To not waste previously emitted tokens

aesthesia · 2026-06-22T05:42:48 1782106968

Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?

stymaar · 2026-06-22T08:03:54 1782115434

Sure, but does llama-cpp support that?

nl · 2026-06-23T00:22:01 1782174121

It does and this is how I did it.

But actually getting that grammar right as well as actually making it work with the correct Jinja template to correctly enable thinking mode and parse it out was a lot more work than I expected.