Last updated on September 20th, 2023
Last month, I accidentally discovered OpenAI APIs. It turns out that I’ve explored a ton of resources, and found inspiration. That’s why I decided to start with some short courses about LLMs, prompting, and building a system applying these APIs. This post serves as my compilation of notes from various resources as the first step in exploring this topic.
A brief overview of Artificial Intelligence, Machine Learning, and Deep learning
Deep Learning focuses on the use of artificial neural networks to model and solve complex problems. These networks are inspired by the structure and function of the human brain, consisting of layers of interconnected neurons. DL learns hierarchical feature representations from raw data, which makes it well-suited for tasks such as image recognition, natural language processing, and reinforcement learning.
Machine Learning focuses on the development of algorithms that allow machines to learn from data without being explicitly programmed. ML systems use statistical techniques to analyze large datasets, identify patterns, and build mathematical models that can make predictions or decisions based on new, unseen data. The primary goal of ML is to enable machines to automatically improve their performance and understanding of certain tasks over time without human intervention.
Artificial Intelligence: AI is the broadest concept among the three. It refers to the field of computer science dedicated to creating machines that can perform tasks that would typically require human-like intelligence. These tasks include problem-solving, pattern recognition, understanding natural languages, speech recognition, and decision-making. AI systems can be classified into two types: Narrow AI (or weak AI), which is designed to perform specific tasks, and General AI (or strong AI), which is aimed to have the ability to perform any intellectual task a human can do.
LLMs and Generative Artificial Intelligence (GenAI) are two important pieces of deep learning. LLMs and GenAI systems typically use “prompts” as text input to guide the model. Prompts provide instructions and context for a specific task to achieve a desired outcome.
The basics of Large Language Models (LLMs)
To put it simply, LLM is a type of neural network-based model that is trained on massive amounts of data to understand and generate human language. It involves predicting the next word given the context of the preceding words, using mathematical calculations and probability.
Two types LLMs
Base LLM
The first one is Base LLM. Basically, this model is trained to predict the next word, based on text training data, from a large amount of data to figure out what’s the next most likely word to follow.
Therefore, when presented with a prompt like “Once upon a time there was a unicorn”, the model will repeatedly predict one word at a time, and come up with a completion such as “that lived in a magical forest with all her unicorn friends”. However, a limitation of this model is shown when prompted with a question like “What is the capital of France?”. In this case, the model might generate something like: “What is France’s largest city?” or “What is France’s population?”, which might be a list of quiz questions about France commonly found on the internet. However, the expectation is a direct response like “The capital of France is Paris.”
This is where Instruction-tuned LLMs come into play.
Instruction-Tuned LLM
Instruction-tuned LLMs basically have been trained to follow instructions. So how can make the transition from the Base LLM to an instruction-tuned LLM?
Initially, a Base LLM is trained on an extensive dataset, often involving hundreds of billions of words. The procedure can span several months and requires a powerful supercomputing system.
Then, the model undergoes further refinement through fine-tuning on a smaller dataset containing examples where the model’s output aligns with input instructions. This process helps the model learn to predict the next word when attempting to follow an instruction.
To enhance the quality of the LLM’s outputs, it gathers human ratings evaluating various LLM outputs based on criteria like helpfulness, accuracy, and safety. Subsequently, the LLM is further adjusted to increase the likelihood of generating outputs that receive higher ratings. This is typically accomplished through a technique known as Reinforcement Learning from Human Feedback (RLHF).
Transformer Architecture
LLMs are trained from large amounts of data, based on transformer architecture, such as GPT 3.5.
Transformer is a neural network model that focuses on self-attention mechanisms. It has become popular for its excellent performance in Natural Language Processing.
Look at the graphic below to learn more about transformers’ high-level information processing workflow.
The Transformer model replaces the traditional recurrent and convolutional layers in other deep learning architectures with self-attention mechanisms, which enables it to better capture long-range dependencies and relationships between words in input sequences. This makes it well-suited for tasks such as machine translation, text summarization, and question-answering.
Key components of the Transformer architecture include:
- Encoder-Decoder Architecture: A Transformer contains an encoder and a decoder. Basically, the encoder processes the input sequence and generates a contextualized representation. The decoder uses this representation along with self-attention and cross-attention mechanisms to generate the output sequence.
- Self-Attention Mechanism: computes an attention score for each word in the input sequence concerning all other words in the sequence. This allows the model to better handle long-range dependencies and weigh the importance of each word in the context of others. Let’s consider a translation task with the input sentence: “Why did the banana cross the road?” In this context, the attention mechanism provides context around the items in the input sequence. Rather than starting the translation with the word ”why” (which is at the start of the sentence), the transformer attempts to identify the context that brings meaning to each word in the sentences. It allows the transformer to run multiple sequences in parallel through the encode-decode mechanism, thereby speeding up the learning time.
Transformers analyze a prompt and identify related blocks of text within a context window, which acts as a dynamic memory of the conversation. The larger the context window, the more relations LLM can see at once. For instance, GPT-4 and GPT-4-32K – the latter can process four times larger inputs at a time but is slower and more expensive, so use it only when necessary.
To summarize, Transformer architecture is a neural network to enables the model to process sequences of data. It achieves this by understanding the context of each word in a sentence, in relation to every other word. This allows the model to construct a comprehensive understanding of the sentence’s structure and the meaning of the words it contains. The network processes data sequences in parallel and uses attention layers to simulate the focus of attention in the human brain.
Prompt engineering
As mentioned earlier, as a prompt engineer, our objective is to help a transformer utilize the self-attention mechanism and context window more effectively to facilitate the identification of relevant blocks. To put it simply, it is a process of crafting prompts for LLMs to guide them into generating a desired output. Before delving into the principles of prompting, let’s first figure out the prompt components.
Prompt components
A prompt consists of several components that help language models understand the task. By breaking down a prompt into its different components, you can provide more precise instructions to LLMs.
Role
The Role component tells the model who or what it should pretend to be when completing the task. It helps shape the context and style of the response.
Context
Think of the Context component as the backstory. It provides essential information about the topic at hand or any relevant background details that the LLM should consider when generating a response. However, be mindful not to overload it with unnecessary details, as this can make communication with the LLM more complex. Instead, treat it like an introductory message at the beginning of a meeting – tell it what it needs to know, and don’t overwhelm it with details.
Instruction
The Instruction component is your direct request to the LLM. It specifies the task you want the model to perform. This task can range from summarizing a text to answering a specific question. The key here is to be clear and concise in your instructions. The more precise your request, the better the chance of getting the desired result. Crafting a prompt is more art than science, so don’t be afraid to experiment with different instructions until you achieve the outcome you want.
Data
The Data component includes any relevant data or information that the LLM should consider when generating a response. This could include specific keywords or phrases, documents, or other relevant information. Be cautious not to overwhelm the model with too much data to process.
Output
The Output component defines how you want the LLM to structure its response. It specifies the format and layout, which could include particular formatting requirements.
Example
The Example component is like a blueprint. It provides a sample response to show the LLM what you expect. It ensures the model understands the task and helps it generate the desired response.
Constraints
The Constraints component sets the boundaries within which the LLM should work when crafting its response. This might include time limits, computational resources, or any other relevant limitations. It’s where you fine-tune the model to align with your requirements.
Evaluation Criteria
The Evaluation Criteria define the standards for accuracy, specificity, or any other factors relevant to evaluating the response.
Prompting parameters
Prompt parameters are configurable settings that determine how a language model interacts with your prompt. By adjusting these parameters, you can optimize the performance of an LLM, influence the content it generates, and manage its behaviour according to your requirements.
Temperature
The Temperature [0-1] – controls the creativity and randomness of the model’s output (higher is more creative, lower is more focused and deterministic)
Temperature is like the spice level in a dish. A low temperature (mild spice) produces a more predictable and consistent flavour, while a high temperature (extra spicy) leads to a more diverse and adventurous taste experience.
Top P
The Top P: [0-1] – scope of randomness (A higher means a larger set of tokens will be considered, introducing more randomness and variety. A lower means generating more focused and narrow text)
This parameter is like the variety of ingredients available when making a meal. A higher Top P means you have more items to choose from, leading to more possible recipes and unexpected combinations. A lower Top P narrows your choices, producing more focused and specific dishes.
Stop Sequences
The stop_sequences [list of string] is a list of strings or tokens that, when encountered by the model, will cause it to stop generating further text. This helps control the length of the generated content, preventing the model from producing unwanted text.
Frequency Penalty
The frequency_penalty [-2 to 2] helps control the extent to which the model should favour common words or phrases (higher frequency) over less common words or phrases in the generated output.
A positive value imposes a penalty on high-frequency tokens, making the model more likely to generate less common or rare words in its output. A negative value makes the model favour more common or high-frequency words in its output.
Think of telling stories to friends. If you constantly repeat the same anecdotes, the frequency penalty would remind you to share new, less frequently told stories to keep your listeners engaged.
Presence Penalty
The presence_penalty [-2 to 2] controls the model’s preference for generating repetitive or redundant phrases in the generated output.
A positive value imposes a penalty on tokens or phrases that have already appeared in the current context. This helps prevent the model from generating repetitive text by encouraging it to generate new, non-redundant content. A negative value makes the model more likely to repeat tokens or phrases that have already appeared in the context. This could be useful if you want the model to emphasize certain points or ideas by using repetitive language.
The presence penalty is like choosing topics for a conversation based on what has already been discussed. A positive presence penalty encourages discussing new topics, while a negative presence penalty promotes revisiting previous topics.
Principles of Prompting
This part contains my notes from this course.
Please refer to the OpenAI documentation and my GitHub repo and try it out for yourself.
References
- From ML Engineering to Prompt Engineering
- What are transformers
- Self-attention mechanism explanation
- AI short courses
- OpenAI cookbook
- Prompt Best practices
Next steps
This post marks my first step into the prompts world. In future posts, I’ll try out OpenAI APIs to build a Telegram chatbot.
Stay tuned! It’s going to be an exciting journey!