“Many builders use the identical context repeatedly throughout a number of API calls when constructing AI purposes, like when making edits to a codebase or having lengthy, multi-turn conversations with a chatbot,” OpenAI defined, including that the rationale is to cut back token consumption when sending a request to the LLM.
What meaning is that when a brand new request is available in, the LLM checks if some components of the request are cached. In case it’s cached, it makes use of the cached model, in any other case it runs the total request.
OpenAI’s new immediate caching functionality works on the identical elementary precept, which may assist builders save on price and time.