Large Language Models

Q) What is context window?

The context window of a generative AI model refers to the maximum amount of information the model can consider at once when generating a response. It is usually measured in tokens (words and punctuation broken down into smaller chunks).

Examples:

Short Context Window (e.g., 512 tokens)
- Imagine you’re writing a story, but the model can only “remember” the last 512 tokens (roughly 300-400 words). As you keep adding text, the model starts to “forget” earlier parts of the story, which could lead to inconsistencies or a lack of coherence with the plot introduced earlier.
Longer Context Window (e.g., 4096 tokens)
- Now, suppose the model can handle 4096 tokens (about 2500-3000 words). In this case, it could keep track of more details from the start of a conversation, a complex story, or a long document, allowing it to generate content that maintains a consistent narrative or answers questions based on previous information.

Real-World Use:

Chat Applications: In a long conversation with a chat model, a larger context window allows the model to recall earlier messages, making its responses relevant and coherent even after several exchanges.
Document Summarization: For summarizing a long article, a model with a longer context window can process more of the content at once, producing a more accurate summary.

A larger context window improves the model’s ability to maintain focus and accuracy, but it also requires more computational power.

Q) What is RAG?

RAG is generally superior for retrieving factual information that is not present in the LLM’s training data or is private. It allows you to dynamically integrate external knowledge without modifying the model’s weights. Fine-tuning, on the other hand, is more suitable for teaching the model specialized tasks or adapting it to a specific domain.

Q) What’s the temperature parameter of Large Language Models?

The Temperature parameter acts like a “Creativity Knob” for the model. It controls how much randomness is introduced when the model selects the next word in a sentence.

When an LLM predicts the next word, it assigns a probability to every possible word in its vocabulary. Temperature adjusts these probabilities before the final choice is made.

Low Temperature (0.0 – 0.3)
Behavior: The model becomes highly deterministic and focused. It almost always picks the most likely word.
Best for: Coding, factual answers, logic puzzles, or tasks requiring consistency.
Analogy: A strict accountant who only follows the rulebook.
High Temperature (0.7 – 1.0+)
Behavior: The model becomes more random and diverse. It flattens the probability curve, giving less common words a chance to be picked.
Best for: Creative writing, brainstorming, poetry, or generating unique ideas.
Analogy: An improvisational jazz musician trying new riffs.

Example: Completing the phrase " The sky is…"

Temperature	Likely Output	Why?
Low (0.1)	“blue.”	It picks the statistically highest probability word.
High (0.9)	“an infinite canvas of violet.”	It risks picking lower-probability words for flair.

Note: If the temperature is too high (e.g., 2.0), the model often generates nonsense or gibberish because it starts picking completely unrelated words.

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)