Rise of Generative AI
Generative AI (GenAI) has gained significant prominence in the last couple of years. This technological breakthrough has created the possibility of integrating various innovative use cases into applications across domains. Organizations of all sizes, small and large, have started exploring, experimenting and adopting GenAI. Tools, frameworks and platforms are rapidly evolving and allow application developers to easily integrate GenAI capabilities into their existing applications or to build new innovative applications leveraging GenAI. There are lots of options available – GenAI capabilities available as services, foundation models, open-source models and model provider platforms that allow hosting of models.
Despite all this excellent environment to accelerate GenAI adoption, organizations constantly face one challenge while working with GenAI. GenAI models are fundamentally next token generators. The inherent nature of the output or response of GenAI models is that it is probabilistic and non-deterministic. Other challenges include the foundation models lacking certain domain specific data or being too generic and unable to consistently produce desired responses. The unpredictable nature of GenAI responses can cause significant challenges to the adoption of GenAI in enterprise use cases. In the below sections, we will look at the various solutions that are available to achieve desirable and more suitable responses from GenAI models. We will also discuss the scenarios where each solution is applicable as well as the pros and cons associated with each approach.
Achieving desired results from GenAI models
The following are the different solutions that Enterprise organizations can adopt to achieve desired results from GenAI models.
Prompt Engineering
In the GenAI ecosystem, prompt refers to the query or the input that the user gives to the model. The model processes the prompt and generates output which is the response sent back to the user. Prompting refers to giving an input to the model and getting the output.
Apart from the basic user input as query, the prompt can take more advanced forms. The user can make use of additional techniques, while crafting the prompt, to interact with the model. These set of techniques commonly referring to prompt engineering help the user in interacting with the model. Data in the prompt can be parameterized to help with reuse of prompts for different scenarios. Information can be included in the prompt to instruct the model to respond in a certain manner. Prompt can also include instructions to avoid certain undesired outputs, conditioning to meet certain criteria or guidelines to be followed for generating the response.
- Zero-shot prompting
This is the default and basic mechanism for interacting with the models. The prompt contains the input alone with no additional examples of how the model response should look like.
- Few-shot prompting
In few-shot prompting, along with the input the prompt also contains one or more crafted examples of how the model output should be for a given input. The input and output pair examples will serve as a reference to the model while generating output for the current input. This technique helps in driving the behavior of the model interaction and the output sequence just by giving examples in the input context.

- Chain-of-Thought (CoT) prompting
Chain of thought prompting is a great way to improve the performance of model while dealing with logical, arithmetic or reasoning tasks. This technique results in the model breaking down the task to a sequence of intermediate logical steps which increases the model’s effectiveness in generating the right output.
A sample model output for a standard prompting Vs Chain-of-Thought prompting:

CoT can be used along with other prompting techniques. Examples of Zero-shot-CoT, Few-shot-CoT:

- ReAct prompting
ReAct refers to combination of reasoning and acting. The model performs multiple cycles of reasoning, acting and observation to reach the final output. In this mechanism, the models are also equipped with ability to interact with external environment (Eg: Internet Search, Wikipedia etc.) which it can use in the act step.

Apart from these, there are advanced techniques like self-consistency prompting, Tree of Thoughts etc.
The biggest advantage of the Prompt Engineering solution is that it is less complex, doesn’t need expertise and can be adopted by non-technical users as well with ease. This has led to widespread adoption of this solution. GenAI libraries and frameworks provide constructs like Messages and PromptTemplate that further simplify the creation and reuse of prompts.
The downside with prompting is that it doesn’t enhance the knowledge of the models. Also, there are limitations to the amount of behavior change that can be achieved with the models by using prompting.
Retrieval Augmented Generation (RAG)
RAG solution has become significantly popular due to the ability to connect models to private enterprise data. RAG at a high level involves the following steps:
- Vector embeddings are generated for enterprise data present in private data stores. The embeddings are stored in Vector Stores for querying.
- When a user sends a prompt to the model, an embedding is generated for the query. This is matched against the vector store and the data that is semantically matching the query embedding is retrieved.
- The user prompt is augmented with the retrieved data. And both these together are sent to the model in the context.
- Model processes the prompt and generates the final response using the additional data sent in the context.

The key benefit of RAG approach is that the model is provided with knowledge of enterprise private data. It can generate response based on private data without needing any fine-tuning or additional training. Since data is private and safe, this addresses one of the major concerns of enterprises around data security. RAG also addresses the issue of hallucinations where models can generate fake output in the absence of actual data for the user query. Since the data is retrieved from existing sources and the prompt is augmented, the hallucinations can be avoided. RAG needs some technical expertise compared to solutions like prompting. However, open-source frameworks like LangChain and LlamaIndex have made building RAG based applications easier. Platforms like Google’s Dialogflow, Vertex AI Agent Builder have enabled creating low-code, no-code RAG agents and applications. This has led to enterprises adopting this approach for building Enterprise Search and Q&A conversational applications that leverage private data.
One downside of RAG approach is that the model is not still knowledgeable of the whole data in the data store. The amount of information that model is aware of is limited to what is retrieved from vector store and sent additionally in the context. Models have limits on the number of tokens that can be sent in context which limits the amount of data the model uses to generate the response. This also has an impact on the processing and costs. Model API pricing depends on the number of tokens sent in the context. RAG includes sending more data in the context which results in more costs for each call. Additionally, the RAG retrieval and model processing of the whole context can result in more latency in generating the responses.
Fine-Tuning
Fine-Tuning refers to the process of picking a generic foundation model and tuning it, customizing for a specific task, behavior or a specific domain. This uses a lot of additional data and follows the model training process. For example, OpenAI’s ChatGPT is a fine-tuned model trained for chat or conversation tasks and based on their foundation GPT model. The process of fine-tuning involves supervised learning and reinforcement learning from human feedback (RLHF). This approach is followed when an enterprise has a requirement or use case that cannot be addressed by prompting or RAG. And there is enough data available to fine-tune a model and customize it for the requirement.
The benefit is, the fine-tuned model generates consistent and better results for the fine-tuned scenarios than a generic foundation model. The model is trained with specific, domain data and doesn’t need additional data to be sent in the context window. So, the costs for model invocations are lower compared to RAG solution.
The downside is that it needs a good amount of data, computational resources and technical expertise to build a fine-tuned model that can generate better results than foundation models. And it involves significant costs in terms of time and resources. Platforms like Google’s Vertex AI and Amazon SageMaker AI greatly simplify the fine-tuning process and can be of great help in creating and deploying fine-tuned models.
Pre-Training
Pre-Training refers to the standard model training process followed to create Large Language Models (LLMs) and foundation models like Gemini from Google, GPT from OpenAI etc. Building the foundation models requires significant expertise, huge volumes of data, heavy compute resources and incurs spending tremendous amount of time and costs. There are certain special cases where large enterprises use pre-training to build special, custom models for domain specific requirements instead of the generic models. For example, BloombergGPT is an LLM built from scratch for finance and Google’s Med-PaLM is an LLM built specifically for medical domain. Pre-training solution is adopted by enterprises that build foundation models and is usually not required for others that are looking to build applications leveraging existing GenAI models.
Conclusion

Each of the solutions discussed has certain advantages and disadvantages and is ideal for certain scenarios. So, the choice of solution adopted depends on the requirements, the expected outcome and the resources available to achieve the desired outcome. Organizations should carefully consider the solutions available against the requirement and plan their GenAI adoption strategy accordingly.