Di's Blog

Persisting Agent State

Posted on 2025-01-24

Persistence in LangGraph

Persistence is a cornerstone for building robust and production-grade applications. LandGraph introduces a game-changing feature that ensures application states are stored and retrievable at any point. This redefines reliability and scalability in workflow management. This capability is especially vital when executing workflows involving interruptions, user inputs, or debugging. Whether you're building a simple app or an enterprise-grade system, persistence ensures your application is always ready to handle interruptions and user interactions gracefully.

The "Persisting Agent Stage" enables seamless workflows, especially in user-facing applications. Here’s why this feature is critical:

Human-in-the-Loop Workflows: Many applications rely on user input to make decisions or advance processes. With persistence, LandGraph allows the graph execution to pause, checkpoint the state into persistent storage, and resume later. This means the application can wait for user input and continue without losing context.
Debugging and History: Persistence creates a robust mechanism for saving the application state after every step. This makes debugging easier and enables the creation of detailed execution histories.
Support for Multi-Session Scenarios: Applications often require users to switch between sessions while maintaining their progress. Persistence ensures continuity by saving states into persistent storage.

At the heart of this feature is the CheckPointer object, a persistence layer implemented by LandGraph. Here’s how it works:

Integration with Databases The CheckPointer can save states into various database types, including:
- Document databases: Firestore, MongoDB
- Relational databases: PostgreSQL, SQLite, MySQL
- Graph databases: Neo4j, AWS Neptune
For example, the following section will focus on persisting states into an SQLite database, a popular choice for local environments. The process can also be extended to managed cloud databases like Google Cloud SQL or AWS RDS.
State Management As each node in the graph executes, the CheckPointer saves the updated state into the database. This ensures that states are recoverable after interruptions, enabling the graph to resume execution from exactly where it left off.

To implement persistence, follow these simple steps:

Import the CheckPointer object from LandGraph.
Create an instance of CheckPointer and configure it with a connection string (local or cloud-based database).
Pass the CheckPointer instance to your graph during creation. LandGraph will handle state persistence automatically after each node execution.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":checkpoints.sqlite:")
graph = workflow.complie(checkpointer=memory)

The result is that you can pause the graph, fetch user input, and continue execution seamlessly, all while ensuring states are securely stored in your chosen database.

MemorySaver + Interrupts = Human In The Loop

Human-in-the-loop systems are essential to modern applications, allowing seamless integration of human feedback into automated workflows. With the help of the MemorySaver feature, you can build applications using LangGraph that pause, capture user input, and resume execution effortlessly.

In workflows involving human interaction, there are moments where the application needs to pause, gather feedback from the user, and then continue processing. For instance, consider a sequence of tasks where:

A process executes its initial steps.
The system pauses to collect human input.
The workflow resumes, incorporating the user’s feedback.

This type of flow requires interrupts to halt the execution and persistence to save the current state of the workflow. Langraph provides the tools to manage both seamlessly.

Implementation

To illustrate, let’s build a straightforward graph with the following steps:

Start with a simple initial node.
Execute a task and pause for human feedback.
Resume execution with the updated state and complete the workflow.

We use Langraph's MemorySaver, a checkpointing tool that saves the workflow’s state in memory after each node’s execution. This ephemeral storage method is perfect for local testing and prototyping. Here’s a simplified version of the setup:

from dotenv import load_dotenv

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver


class State(TypedDict):
    input: str
    user_feedback: str


def step_1(state: State) -> None:
    print("---Step 1---")


def human_feedback(state: State) -> None:
    print("---human_feedback---")


def step_3(state: State) -> None:
    print("---Step 3--")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)


memory = MemorySaver()

graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

The graph visualization by using Mermaid.ink is here:

MemorySaver Implementations

Integrating human feedback into automated systems is a growing trend in AI development. It bridges the gap between machine automation and human judgment, enabling better decision-making, improved accuracy, and adaptability. In this section, we explore how to incorporate human-in-the-loop functionality into a graph-based system while leveraging memory storage to track execution states. This walkthrough showcases the process from initialization to final execution.

from dotenv import load_dotenv

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver


class State(TypedDict):
    input: str
    user_feedback: str


def step_1(state: State) -> None:
    print("### Step 1 ###")


def human_feedback(state: State) -> None:
    print("### Human Feedback ###")


def step_3(state: State) -> None:
    print("### Step 3 ###")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)


memory = MemorySaver()

graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

if __name__ == "__main__":
    thread = {"configurable": {"thread_id": "1"}}

    initial_input = {"input": "hello world"}

    for event in graph.stream(initial_input, thread, stream_mode="values"):
        print(event)

    print(graph.get_state(thread).next)

    user_input = input("How do you want to update the state? ")

    graph.update_state(thread, {"user_feedback": user_input}, as_node="human_feedback")

    print("### State after update ###")
    print(graph.get_state(thread))

    print(graph.get_state(thread).next)

    for event in graph.stream(None, thread, stream_mode="values"):
        print(event)

The graph’s execution is tied to a thread variable, a dictionary initialized with a thread_id. This serves as a session or conversation identifier, distinguishing various graph runs. For simplicity, the thread_id is set to 1, though a more robust implementation would use a UUID. The graph processes events using graph.stream(), which accepts the initial input and thread details. Events are streamed in value mode, and each event is printed for transparency.

During execution:

Input is processed.
Node executions are logged.
Interruptions allow for dynamic human input.

Running the graph in debug mode provides insights into:

Memory storage (memory.storage) containing nested objects that log the graph state.
Transition logs for each node, showing updates or lack thereof.

At an interrupt, human feedback is solicited using Python's built-in input() function. This input updates the state dynamically. Once human input is integrated, the graph resumes execution. Subsequent steps process the updated state, leading to the graph’s completion.

SqliteSaver

Switching from an ephemeral memory-based state saver to a persistent database saver can significantly enhance the durability and traceability of your graph’s execution. In this section, we’ll explore how to replace the in-memory MemorySaver with an SQLiteSaver for long-term storage and easy debugging.

The MemorySaver is transient, meaning all state information vanishes after the program stops. By using an SQLite database, you can:

Persist graph states across runs.
Debug and troubleshoot using a structured database.
Resume executions exactly where they were interrupted.

import sqlite3

from dotenv import load_dotenv
from langgraph.checkpoint.sqlite import SqliteSaver

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END


class State(TypedDict):
    input: str
    user_feedback: str


def step_1(state: State) -> None:
    print("### Step 1 ###")


def human_feedback(state: State) -> None:
    print("### Human Feedback ###")


def step_3(state: State) -> None:
    print("### Step 3 ###")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)

conn = sqlite3.connect("checkpoints.sqlite", check_same_thread=False)
memory = SqliteSaver(conn)
graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

if __name__ == "__main__":
    thread = {"configurable": {"thread_id": "1"}}

    initial_input = {"input": "hello world"}

    for event in graph.stream(initial_input, thread, stream_mode="values"):
        print(event)

    print(graph.get_state(thread).next)

    user_input = input("How do you want to update the state: ")

    graph.update_state(thread, {"user_feedback": user_input}, as_node="human_feedback")

    print("### State after update ###")
    print(graph.get_state(thread))

    print(graph.get_state(thread).next)

    for event in graph.stream(None, thread, stream_mode="values"):
        print(event)

We start by importing the required modules. Then Initialize a connection to your SQLite database. The check_same_thread=False flag ensures thread-safe database operations, essential for stopping and restarting execution across different threads. After that we create an instance of SQLiteSaver and pass it the SQLite connection. This saver integrates seamlessly with the graph execution pipeline, persisting states to the SQLite database.

Initial Execution: Run the graph with the SQLiteSaver. After execution, you’ll see a new file, checkpoints.sqlite, created in your project directory.
Inspect the Database: Use your IDE’s database tools (e.g. SQLite3 Editor for VS Code) to load and inspect the checkpoints.sqlite file. You’ll find a table storing graph states, similar to what you’d see with MemorySaver, but now it’s persistent.

Changing the thread_id allows you to simulate a new session while retaining access to previous runs. When resuming, the graph starts from the last recorded state. You can verify this by inspecting the database entries for the new thread_id.

For enhanced traceability, integrate Langsmith for tracking and debugging. Langsmith provides detailed insights, including thread metadata and execution traces.

2025 January - What I Have Read

Posted on 2025-01-04

Substack

How Meta Plans To Crank the Dopamine Machine With Infinite AI-Generated Content - The Algorithmic Bridge [Link]

This article discussed AI’s most dangerous potential - its ability to manipulate and addict humans through hyper-targeted entertainment. This trend, spearheaded by companies like Meta, risks reshaping human cognition and agency, raising existential questions about freedom, pleasure, and the future of society.

One good point is that the killer robots are brain-hacking entertainment. A very plausible dystopia involves technology (e.g., AI-driven entertainment) manipulating human attention and cognition for profit. Traditional TV was a prototype of mental manipulation but lacked personalization. Current platforms such as Netflix and TikTok use algorithms to cater to preferences but still feel limited. Future AI will create hyper-personalized content tailored to individual preferences in real-time, exploiting human psychology. Meta’s generative AI plans are the next step toward addictive, manipulative entertainment. Meta announced that AI content creators will be designed to enhance engagement on platforms like Facebook and Instagram. Connor Hayes, Meta’s VP for generative AI, explained how AI accounts will create and share engaging content.

The Five Stages of AGI Grief - Marcus on AI [Link]

Marcus uses the framework of the Kübler-Ross model of grief to describe the emotional responses people are having (or will likely have) to the ongoing developments in Artificial General Intelligence (AGI). He argues that many people are not yet facing the reality of AGI and are likely to go through similar stages of grief as it gets closer.

Denial: Many people, including some experts, are still in denial about the possibility and speed of AGI development. They dismiss the progress, underestimate its potential, or claim it's decades away.
Anger: Once denial fades, anger emerges, often directed at those perceived as enabling or hyping AGI. This can be targeted at AI researchers, tech companies, or even the technology itself.
Bargaining: In this stage, people try to find ways to control or mitigate AGI, often through unrealistic expectations or proposed solutions.
Depression: As bargaining fails, a sense of profound unease and hopelessness may set in. This is the realization that AGI could fundamentally change society in ways that are difficult to predict or control, leading to feelings of powerlessness.
Acceptance: This is the final stage, where people begin to accept the reality of AGI and its potential impact. This isn't necessarily cheerful, but it's characterized by a shift from denial and fear towards a more realistic view.

The Taiwan posts - Noahpinion [Link]

Disney Paid Off Trump for a Reason - BIG by Matt Stoller [Link]

Fubo, a sports streaming service, had previously won a preliminary injunction against a joint venture between Disney, Fox, and Warner Bros, arguing that the venture was an illegal merger. However, Fubo's stock wasn't performing well, leading Fubo CEO David Gandler to sell a controlling stake in his company to Disney.

Here are the rationales behind this decision, according to the sources:

Fubo's CEO, David Gandler, profited from winning an antitrust suit and joined forces with a large corporation. Instead of being an underdog fighting against major corporations, Fubo has now joined forces with one of them. Fubo will now have Disney's resources, while its leaders imagine that it will operate somewhat independently.
Disney made a $16 million payment for defamation against Trump, which is considered questionable by legal analysts, in order to gain credibility with Trump. The aim of this was to ensure that government enforcers would not interfere with the deal.
Fubo's leaders may be ignoring the risks involved in the merger. They are potentially exhibiting a kind of "malevolent naivete" and airbrushing away their own violation of the law.

The sources suggest that Fubo's leadership may not be considering some of the risks associated with mergers. Mergers carry significant risk, and they can fall apart for a variety of reasons. During the 18-24 months that it takes to clear financing and regulatory hurdles, a company under contract to be sold cannot make significant strategic decisions or investments, while the purchaser can do whatever they want. If the deal falls apart, the company that was to be sold could be in a significantly worse position.

The sources point out that there is a possibility that another private litigant could take Fubo's place and sue, using the legal precedent set by Fubo. This is evidenced by a letter sent by EchoStar to the court, in which the company states that it's considering suing along the same lines as Fubo. This may not matter to Disney, since they now control Fubo, but it should be a source of concern for Fubo's leadership team who have essentially bet their company on a violation of the law.

A private litigant, such as EchoStar, could take Fubo's place and sue Disney, Fox, and Warner Bros, using the same legal arguments that Fubo successfully used to win a preliminary injunction. This is a possibility because the legal precedent set by Fubo remains, even though Fubo is now under Disney's control.

Here's why this could be problematic for Fubo but not necessarily for Disney:

Fubo is in a vulnerable position due to the merger agreement. While the deal is pending, Fubo is restricted in its strategic decision-making and investments, effectively putting the company in "limbo". This means Fubo cannot make significant moves to respond to a new lawsuit.
Disney, as the purchaser, is not similarly restricted. They can continue to operate as they see fit. They have the resources to handle a new legal challenge.
If the merger fails, Fubo will have wasted 18-24 months with the potential for no significant strategic moves. It could end up in a weakened state compared to competitors who were not in a merger process. The company might even become "a limping and probably dead company". Failed mergers can also lead to leadership changes, such as the CEO getting fired.
Disney has already taken steps to ensure the deal's success, including a payment to gain credibility with the current administration. While another lawsuit could present a challenge, Disney has the resources and political connections to navigate it, which Fubo does not.
The incentive to complete the deal is different for Disney and Fubo. Disney will remain a major player regardless of the deal's outcome. However, Fubo's future is heavily dependent on the merger. This makes Fubo more vulnerable if the deal is challenged.

The rise and fall of "fact-checking" - Silver Bulletin [Link]

The main opinion of this article is that Meta's decision to replace fact-checkers with a community notes system is justifiable because fact-checkers have been politically biased and have not effectively addressed the issue of misinformation.

While the author agrees with Zuckerberg's decision, they also acknowledge that Zuckerberg's motivations may not be high-minded, but rather driven by political pressure and business incentives. Despite that, the author thinks the move is "pointing in the right direction," and agrees with Zuckerberg's claim that fact-checkers have been too politically biased. The author also admits their own biases and that Community Notes is a new program that might also have problems.

US Banks: Profits Surge - App Economy Insights [Link]

CES 2025: AI Takes Over - App Economy Insights [Link]

a16z's big ideas in tech for 2025 - ben lang's notes [Link]

Andreessen Horowitz’s list of big ideas in tech for 2025:

How AI-assisted coding will change software engineering: hard truths - The Pragmatic Engineer [Link]

Great article!

This "70% problem" suggests that current AI coding tools are best viewed as:

Prototyping accelerators for experienced developers

Learning aids for those committed to understanding development

MVP generators for validating ideas quickly

Current tools mostly wait for our commands. But look at newer features like Anthropic's computer use in Claude, or Cline's ability to automatically launch browsers and run tests. These aren't just glorified autocomplete - they're actually understanding tasks and taking initiative to solve problems.

Think about debugging: Instead of just suggesting fixes, these agents can:

Proactively identify potential issues

Launch and run test suites

Inspect UI elements and capture screenshots

Propose and implement fixes

Validate the solutions work (this could be a big deal)

― The 70% problem: Hard truths about AI-assisted coding - Elevate [Link]

Great pragmatic article! And it's well-said in the end: "Software quality was (perhaps) never primarily limited by coding speed...The goal isn't to write more code faster. It's to build better software. "

AI tools help experienced developers more than beginners. This is similar to the fact that AI helps top biologists to be successful more than normal biologists. The results and efficiency of AI usage differ based on users' domain expertise. This is called 'knowledge paradox'. AI can help to get the first 70% job done quickly, but the efforts on the final 30% have diminishing returns. This is called 'AI learning curve paradox'.

o1 isn’t a chat model (and that’s the point) - Latent Space [Link]

Provide Extensive Context: Give 10x more context than you think is necessary. This includes details about previous attempts, database schemas, and company-specific information. Think of o1 like a new hire that needs all the relevant information to understand the task. Put the context at the end of your prompt.

Use tools like voice memos to capture context and paste transcripts. You can also save reusable segments of context for future use. AI assistants within other products can help extract context.
Focus on the Desired Output: Instead of telling o1 how to answer, clearly describe what you want the output to be. Let o1 plan and resolve its own steps, leveraging its autonomous reasoning.
Define Clear Evaluation Criteria: Develop specific criteria for what constitutes a "good" output so that o1 can evaluate its own output and improve. This moves the LLM-as-Judge into the prompt itself. Ask for one specific output per prompt.
Be Explicit About Output Format: o1 often defaults to a report-style output with numbered headings. Be clear if you need complete files or other specific formats.
Manage Context and Expect Latency: Since o1 is not a chat model, it will not respond in real time, like email. Make sure you can manage and see the context you are providing to the model. o1 is better suited to high-latency, long-running tasks.

The Deep Roots of DeepSeek: How It All Began - Recode China AI [Link]

Liang's Visions from his first public interview in May 2023:

AI Development:

Liang aims to build AGI, not just improve existing models like ChatGPT.
He prioritizes deep research over quick applications, requiring more resources.
He sees AI as a way to test ideas about human intelligence, like whether language is key to thought.
He plans to share DeepSeek’s results publicly to keep AI accessible and affordable.

Company Culture & Innovation:

He hires based on ability, creativity, and passion, preferring fresh graduates for key roles.
Employees should have freedom to explore and learn from mistakes.
It can't be forced or taught.
A shared pace and curiosity drive the team, not strict rules or KPIs.

Competition:

Startups can still challenge big companies since AI tech is evolving.
No one has a clear lead in AI yet.
LLM applications will become easier, creating startup opportunities for decades.
AI believers stay in for the long run.
Unconventional approaches can be a game-changer.

Resources & Funding:

Securing GPUs and a strong engineering team is crucial.
Traditional VC funding may not fit DeepSeek’s research-heavy approach.
Innovation is costly, and some waste is inevitable.
GPUs are a solid investment as they hold value.

Is DeepSeek the new DeepMind? - AI Supremacy [Link]

Implications for the AI Industry:

DeepSeek's emergence challenges the dominance of Western AI firms like Google DeepMind, Meta, and OpenAI. The success of DeepSeek suggests that open-source models can outperform proprietary ones. It also calls into question the massive spending on AI infrastructure by Big Tech companies.
Its cost-effectiveness is causing enterprises to rethink their AI strategies. The availability of high-performing, cheaper models could disrupt the business model of companies that rely on expensive, proprietary models.
Its achievements indicate that China is becoming a leader in AI, particularly in inference-time compute and compute efficiency. This development raises concerns about America's shrinking lead in artificial intelligence.
Its open-source approach is seen as essential to keeping AI inclusive and accessible. The ability to run powerful models on a laptop could decentralize AI development and reduce reliance on Big Tech.

Arguments about US vs. China in AI:

The article suggests that the U.S. is losing its lead in AI innovation due to its focus on "Tycoon capitalism" and protectionist policies. The U.S. government's export controls on semiconductors, while intended to slow China's progress, may be inadvertently fueling China's self-reliance and innovation.
China has advantages in areas such as manufacturing, go-to-market strategies, talent (STEM programs and ML researchers), and patents. China's progress in various overlapping industries creates a "mutually reinforcing feedback loop". The article implies that Chinese work culture of empowering workers with autonomy and collaboration is a strong contrast to the grueling work schedules, rigid hierarchies, and internal competition that are common in Chinese tech firms.
The article criticizes the massive AI infrastructure projects in the U.S. (dubbed "Project Oracle") as a scheme by the financial elite to control the future of AI. The author argues that these projects prioritize the interests of Big Tech and the financial elite over those of regular citizens and that these AI infrastructure projects are primarily intended to redistribute wealth globally to the elite.

Concerns about AI's Impact:

The author acknowledges concerns that AI could lead to wage deflation, particularly in white-collar jobs where AI can automate tasks.
It questions the assumption that AI will create more jobs than it displaces, noting that AI coding tools could negatively impact software engineers.
It also raises concerns about the potential for misuse of AI, including the use of AI for "authoritarian" control and as a weapon in trade wars. There are also concerns about the potential for backdoors, Trojans, model inversion attacks, sensitive information inference, and automated social engineering via the release of attractive but cheap AI services.

Additional Info:

DeepSeek is an offshoot of a quantitative hedge fund, High-Flyer, and is fully funded by them.
It is noted for being more transparent about its methods compared to some Western AI firms.
Its mission is to "unravel the mystery of Artificial General Intelligence (AGI) with curiosity". They focus on open-source development, research-driven innovation, and making advanced AI accessible to all.

Monopoly Round-Up: China Embarrasses U.S. Big Tech - BIG by Matt Stoller [Link]

DeepSeek, a Chinese AI firm, developed cost-effective AI models that rival U.S. models and released them on an open-source basis. This is a significant accomplishment, especially since the U.S. has placed export controls that prevent China from accessing the best chips. DeepSeek's approach focused on efficiency, rather than raw computing power, which challenges the assumption that computing power is the primary competitive barrier in AI. This development is considered embarrassing and threatening to big tech and U.S. security.
The U.S. has heavily invested in AI, with tech giants spending billions on data centers and infrastructure, betting that these investments will provide a competitive advantage. However, DeepSeek’s success suggests that this approach may be flawed. The sources suggest that the U.S. strategy of denying top chips to China may also be ineffective.
The sources argue that betting on monopolistic national champions is a disastrous national security strategy. It points out that history shows that monopolies are slow to innovate. The U.S. needs to prioritize competition over protecting monopolies. The sources criticize large U.S. tech firms (Meta, Microsoft, Google, Amazon, Apple) for becoming slothful bureaucracies that are not very good at developing and deploying technology.
Chinese policy is noted to be more aggressive in forcing competition in some sectors. China's electric vehicle industry is cited as an example of this. The Chinese government's crackdown on its big tech firms and financial sector is also mentioned as a move that has seemingly benefited the economy by driving innovation. The success of companies like ByteDance and DeepSeek is mentioned as evidence of this.
The sources highlight that U.S. anti-monopoly laws take too long to take effect. It uses the example of the Federal Trade Commission's case against Facebook for its acquisition of Instagram and WhatsApp. This case highlights how companies like Facebook acquire and bury innovative competitors rather than compete. It argues that if Facebook had been broken up, there would be tremendous innovation in social networking.
The sources express uncertainty about the future of AI, noting it might not live up to expectations. It also notes that the competitive advantages in AI are not as straightforward as previously thought.

In a rare interview for AnYong Waves, a Chinese media outlet, DeepSeek CEO Liang Wenfeng emphasized innovation as the cornerstone of his ambitious vision:

. . . we believe the most important thing now is to participate in the global innovation wave. For many years, Chinese companies are used to others doing technological innovation, while we focused on application monetization—but this isn’t inevitable. In this wave, our starting point is not to take advantage of the opportunity to make a quick profit, but rather to reach the technical frontier and drive the development of the entire ecosystem.

― 7 Implications of DeepSeek’s Victory Over American AI Companies - The Algorithmic Bridge [Link]

"Every job is a bundle of tasks.

Every new technology wave (including the ongoing rise of Gen AI) attacks this bundle.

New technology may substitute a specific task (Automation) or it may complement a specific task (Augmentation)"

Extend this analogy far enough, and you get this:

Once technology has substituted all tasks in a job bundle, it can effectively displace the job itself.

Of course, there are limits to this logic. This can only be true for a small number of jobs, which involve task execution only.

But most jobs require a lot more than mere task execution.

They require ‘getting things done’. They require achievement of objectives, accomplishment of outcomes.

In other words, most jobs involve goal-seeking.

This is precisely why previous generations of technologies haven’t fully substituted most jobs. They chip away at tasks in the job bundle without really substituting the job entirely.

Because humans retain their right to play because of their ability to plan and sequence tasks together to achieve goals.

In most previous instances, technology augments humans far more than automating an entire job away.

And that is because humans possess a unique advantage: goal-seeking.

― Slow-burn AI: When augmentation, not automation, is the real threat - Platforms, AI, and the Economics of BigTech [Link]

AI agents are the first instance of technology directly attacking and substituting goals within a role or a team.

In doing so, they directly impact power dynamics within an organization, empowering some roles and weakening others, empowering some teams and weakening others.

― How AI agents rewire the organization - Platforms, AI, and the Economics of BigTech [Link]

This is a brilliant article.

Goal-seeking, for the first time, can be performed by technology.

Scope of the role: Effectively, a goal-seeking AI agent can unbundle a goal from the role. They reduce the scope of the role.
Scope of the team: They displace the role entirely in a team if the team can now achieve the same goal using an AI agent.
Rebundling of roles: Role B is eliminated not because its tasks were fully substituted by technology, nor because its goals were fully substituted by technology, but because the scope of the role no longer justified a separate role.
Reworking power structures: Teams have voting rights on the relevance of Roles. The fewer teams speaking to a role’s contributions, the lower the negotiating power for that role within the organization.
Roles unbundle, teams rebundle: this cycle of unbundling and rebundling across roles and teams is inherent to the organization of work. AI isn’t fundamentally changing goal-seeking and resource allocation. It is merely inserting itself into the organization and re-organization of work.

YouTube and Podcasts

2025 Predictions with bestie Gavin Baker - All-In Podcasts [Link]

Interesting discussions about new year predictions. Here is a summary of the predictions:

Chamath Palihapitiya:

Biggest Political Winner: Fiscal conservatives. He believes austerity will reveal waste and fraud in the US government and that this will spill over to state elections.
Biggest Political Loser: Progressivism. He predicts a repudiation of class-based identity politics in multiple Western countries.
Biggest Business Winner: Dollar-denominated stablecoins, which he believes will grow substantially and challenge the dominance of Visa and Mastercard.
Biggest Business Loser: The "MAG 7" companies will see a drawdown in absolute dollars due to high concentration in the indices. He suggests that these companies may not be able to maintain their high valuations, though they are good businesses.
Biggest Business Deal: The collapse of traditional auto OEMs and a wave of auto mega-mergers, triggered by Tesla's strong position.
Most Contrarian Belief: A banking crisis in a major mainline bank, triggered by the total indebtedness of Pax America and the impact of higher interest rates.
Best Performing Asset: Credit Default Swaps (CDS) as an insurance policy against a potential default event.
Worst Performing Asset: The software industrial complex, or large, bloated enterprise software companies.
Most Anticipated Trend: Small, arcane regulatory changes related to the supplemental loss ratio that allow the US to kick the debt can down the road.
Most Anticipated Media: The enormity of files that will be declassified and released by the Trump administration.
Prediction Market: The MAG 7 representation in the S&P 500 shrinks below 30%.

David Friedberg:

Biggest Political Winner: Young political candidates, marking a trend of a shift towards younger leaders.
Biggest Political Loser: Pro-war neoconservatives. He believes they will lose out to figures like JD Vance and Elon Musk.
Biggest Business Winner: Autonomous hardware and robotics, citing the rise of humanoid robots and their applications.
Biggest Business Loser: Old defense and aerospace providers, like Boeing and Lockheed Martin. He predicts a shift towards more tech-oriented and rationalized spending in defense. He also thinks Vertical SaaS companies will struggle as AI replaces their services.
Biggest Business Deal: Massive funding deals for hardware-based manufacturing buildout in the United States, potentially involving government support.
Most Contrarian Belief: A dramatic rise in socialist movements in the United States, fueled by economic inequality and disruption from AI.
Best Performing Asset: Chinese tech stocks or ETFs, based on potential deals between the US and China and the strong fundamentals of Chinese tech companies.
Worst Performing Asset: Vertical SaaS companies again as AI replaces the practices. Also legacy car companies and real estate because of overbuilding and high debt.
Most Anticipated Trend: The announcement of buildout of nuclear power in the United States.
Most Anticipated Media: AI Video Games with dynamic story lines
Prediction Market: Microsoft, AWS, and Google Cloud Revenue Growth.

Gavin Baker:

Biggest Political Winner: Trump and centrism; also Gen X and Elder Millennials.
Biggest Political Loser: Putin, due to Europe rearming, which shifts US resources to the Pacific, and Trump's likely tougher stance.
Biggest Business Winner: Big businesses that use AI thoughtfully, and the robotics industry, as well as companies that make high bandwidth memory.
Biggest Business Loser: Government service providers with over 35% of their revenue coming from the US government. He also thinks Enterprise application software will be hurt by AI agents
Biggest Business Deal: A wave of M&A after a period of inactivity and something significant happening with Intel. Also, he thinks independent AI labs will get acquired.
Most Contrarian Belief: The US will experience at least one year of greater than 5% real GDP growth due to AI and deregulation. He also thinks frontier AI labs will stop releasing their leading-edge models.
Best Performing Asset: Companies that make high bandwidth memory (HBM).
Worst Performing Asset: Enterprise application software.
Most Anticipated Trend: AI will make more progress per quarter in 2025 than it did per year in 2023 and 2024, due to scaling performance through reasoning, pre-training, and test time compute.
Most Anticipated Media: Season 2 of 1923
Prediction Market: US Treasury Market Report on Federal Debt in December 2025 above or below $38 trillion
UFOs: Believes there is a 25% chance the US government is sitting on knowledge of extraterrestrial life.

Jason Calacanis:

Biggest Business Winner: Tesla and Google for AI and Robotics
Biggest Business Loser: Open AI
Biggest Business Deal: Partnerships between Amazon, Uber, Tesla, and Waymo for autonomy, delivery, and e-commerce
Most Contrarian Belief: Open AI will lose its lead and its nonprofit-to-for-profit transition and become the number four player in AI.
Best Performing Asset: MAG 7 stocks
Worst Performing Asset: Legacy car companies and Real Estate.
Most Anticipated Trend: Exits and DPI will shower down, along with a surge in M&A and IPOs
Most Anticipated Media: Legacy media outlets owned by billionaires attempting to steer towards the middle
Prediction Market: Over or under 750,000 deportations by Trump in the first year of office

Building Anthropic | A conversation with our co-founders - Anthropic [Link]

WTF is Artificial Intelligence Really? | Yann LeCun x Nikhil Kamath | People by WTF Ep #4 - Nikhil Kamath [Link]

The Next Frontier: Sam Altman on the Future of A.I. and Society - New York Times Events [Link]

LA's Wildfire Disaster, Zuck Flips on Free Speech, Why Trump Wants Greenland [Link]

Text, camera, action! Frontiers in controllable video generation - William (Bill) Peebles [Link]

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands) - Latent Space [Link]

The State of AI Startups in 2024 [LS Live @ NeurIPS] - Latent Space [Link]

Best of 2024 in Vision [LS Live @ NeurIPS] - Latent Space [Link]

Red-pilled Billionaires, LA Fire Update, Newsom's Price Caps, TikTok Ban, Jobless MBAs - All-In Podcast [Link]

NVIDIA CEO Jensen Huang Keynote at CES 2025 - NVIDIA [Link]

CES 2025 is the world's biggest tech expo. Each January, CES kicks off the tech year by highlighting everything from groundbreaking gadgets to the processors driving our digital world.

NVIDIA's CES announcements showcased its dominance in the AI chip market while highlighting its bold expansion into emerging, high-growth sectors. By emphasizing robotics, autonomous vehicles, and broader accessibility to AI, NVIDIA demonstrated its commitment to staying central to this wave of innovation.

Highlights:

GeForce RTX 50 Series GPUs

NVIDIA unveiled its latest GeForce RTX 50 series GPUs, powered by the advanced Blackwell architecture and set to launch in January. These GPUs deliver significant improvements in gaming and AI performance, with the flagship RTX 5090 priced at $\$1,999$ and the RTX 5070 at $\$549$, surpassing the RTX 4090, which debuted at $\$1,599$ in 2022.

The 50 series also introduces DLSS 4, a cutting-edge Deep Learning Super Sampling technology that employs a transformer-based architecture to generate three AI-rendered frames for every traditionally rendered one, enhancing graphics quality and gaming experiences. NVIDIA partnered with Micron to supply memory chips for these GPUs.

Although GeForce RTX GPUs contributed only 9% of NVIDIA’s revenue in the October quarter, the company’s primary growth continues to come from its Data Center segment, driven by AI demand.
AI Advancements

NVIDIA introduced Nemotron, a new family of AI models derived from Meta’s Llama models, including Llama Nemotron Nano, Super, and Ultra, aimed at advancing AI agent capabilities. CEO Jensen Huang projects that the AI agent market could be worth trillions of dollars.

Additionally, NVIDIA confirmed that its Blackwell AI accelerators are in full production and are being adopted by leading cloud providers and PC manufacturers, further solidifying its position in AI technology.
Robotics and Autonomous Vehicles

NVIDIA debuted Cosmos, the "world's first physical AI model," designed to advance robotics. Trained on 20 million hours of video, Cosmos is open-licensed on GitHub and integrates seamlessly with NVIDIA’s Omniverse platform to provide physics-based simulations for AI model training in robotics and autonomous systems.

In partnership with Toyota, NVIDIA is collaborating on developing the automaker's latest autonomous vehicles. Huang sees robotics and autonomous technology as a $\$1$ trillion market opportunity, expecting NVIDIA’s automotive revenue to grow from $\$4$ billion in FY25 to $\$5$ billion in FY26, spanning Data Center and OEM segments.
Project DIGITS

NVIDIA announced Project DIGITS, a personal AI supercomputer aimed at democratizing access to powerful AI tools. Starting at $\$3,000$, the system features the GB10 Grace Blackwell Superchip, 128GB of unified memory, and up to 4TB of NVMe storage. Users can connect two systems for enhanced processing capabilities.

Designed for AI researchers and data scientists, Project DIGITS provides a cost-effective solution for building complex AI models without relying on large-scale data center resources.

A not comprehensive summary of NVIDIA's efforts on AI, not a summary of this YouTube video:

AI Compute Hardware:

This category includes the physical processing units that perform the core calculations for AI models. These are primarily GPUs, but also include specialized CPUs and other accelerators.

Focus: High-performance, parallel processing, low latency, memory bandwidth, energy efficiency for AI workloads.

Examples:

NVIDIA A100 Tensor Core GPU NVIDIA A40 Tensor Core GPU NVIDIA A10 Tensor Core GPU NVIDIA H100 Tensor Core GPU NVIDIA L40 GPU NVIDIA L4 GPU NVIDIA B100 "Blackwell" Data Center GPU NVIDIA Grace CPU Superchip GeForce RTX 30 Series (Desktop) - Ampere (Relevance for model development) GeForce RTX 50 Series (Desktop) - Blackwell (Relevance for model development) Project DIGITS - Hardware system (personal AI supercomputer).
AI Platforms & Systems:

This category includes integrated hardware and software solutions designed to simplify the development and deployment of AI applications. It encompasses both edge and data center solutions.

Focus: Ease of use, scalability, optimized performance for specific AI tasks, deployment solutions.

Examples:

NVIDIA DGX A100 System NVIDIA Jetson AGX Xavier NX NVIDIA Jetson Orin NVIDIA Jetson Orin Nano NVIDIA Omniverse Platform
AI Software & Development Tools:

This category includes the software libraries, frameworks, and tools that allow developers to build, train, and deploy AI models. It covers both open source and proprietary tools.

Focus: Developer productivity, model performance, framework support, customization.

Examples:

NVIDIA Merlin (Software Library) NVIDIA NeMo Framework NVIDIA TAO Toolkit
AI Applications & Solutions:

This category focuses on specific, industry-focused AI applications built on top of NVIDIA hardware and software.

Focus: Pre-built solutions, vertical market expertise, end-to-end solutions.

Examples: Intelligent Video Analytics (IVA), autonomous vehicle solutions, AI-driven healthcare, generative AI.
AI Research and Frameworks

While related to AI Software and development tools, it deserves its own category because of the open source nature of much of the research based tools and APIs, allowing for community contributions and new technology development.

Focus: Next-generation tools, advanced research, pushing the limits of AI, new technologies and algorythms.

Examples: Nemotron NVIDIA FLARE (Federated Learning Application Runtime Environment) NVIDIA Research Publications and Open-Source Projects TensorFlow and PyTorch (With NVIDIA's Extensions)

So, my takeaway was entirely different. It was not a commentary on Masa, or Larry, or Sam. I think all of those three companies are, frankly, very good. It was more a comment that you have to be very careful to protect the president's legacy, if I were them, to make sure that the things that get announced are actually further down the technical spectrum and are actually going to be real. Because if they achieve these things, but it costs you a billion dollars and you only hire 50 people, there's going to be a little bit of egg on the face. And so, that was sort of my own takeaway. I think that the things were decoupled. It just seemed more like marketing and sizzle and kind of hastily put together. I think it would be great if OpenAI builds another incredible model, whatever comes after o3, o4, o5. But it's not clear that you have to spend $500 billion to do it. - Chamath Palihapitiya

― Trump's First Week: Inauguration Recap, Executive Actions, TikTok, Stargate + Sacks is Back! - All-In Podcast [Link]

There's a thing called Jevons Paradox, which kind of speaks to this concept. SAA actually tweeted about it. It's an economic concept where, as the cost of a particular use goes down, the aggregate demand for all consumption of that thing goes up. So, the basic idea is that as the price of AI gets cheaper and cheaper, we're going to want to use more and more of it. You might actually get more spending on it in the aggregate. That's right—because more and more applications will become economically feasible. Exactly. That is, I think, a powerful argument for why companies are going to want to continue to innovate on frontier models. You guys are taking a very strong point of view that open source is definitely going to win, that the leading model companies are all going to get commoditized, and therefore, there will be no return on capital—essentially forcing continued innovation on the frontier. - David Sacks

But then there's this dark horse that nobody's talking about—it's called electricity. It's called power. And all these vehicles are electric vehicles. If you said, 'You know, I just did some quick back-of-the-envelope calculations,' if all of the miles in California went to EV ride-sharing, you would need to double the energy capacity of California. Right? Let's not even talk about what it would take to double the energy capacity of the grid and things like that in California. Let's not even go there. Even getting 10% or 20% more capacity is going to be a gargantuan, five-to-ten-year exercise. Look, I live in LA—in a nice area in LA—and we have power outages all the freaking time because the grid is messed up. They're sort of upgrading it as things break. That's literally where we're at in LA, in one of the most affluent neighborhoods. That’s just the reality. So, I think the dark horse, kind of hot take, is combustion engine AVs. Because I don’t know how you can scale AVs really, really massively with the electric grid as it is. - Travis Kalanick

I just wanted to read a message from Brian Yutko, who's the CEO of Wisk, which is building a lot of these autonomous systems. He said: 'First, automatic traffic collision avoidance systems do exist right now. These aircraft will not take control from the pilot to save the aircraft, even if the software and systems on the aircraft know that it’s going to collide. That’s the big flip that needs to happen in aviation—automation can actually kick in and take over, even in piloted aircraft, to prevent a crash. That’s the minimum of where we need to go. Some fighter jets have something called Automatic Ground Collision Avoidance Systems that do exactly this when fighter pilots pass out. It’s possible for commercial aviation as well.' And then, the second thing he said is: 'We need to have better ATC (Air Traffic Control) software and automation. Right now, we use VHF radio communications for safety and critical instructions, and that’s kind of insane. We should be using data links, etc. The whole ATC system runs on 1960s technology. They deserve better software and automation in the control towers—it’s totally ripe for change. The problem is that attempts at reform have failed.' - Chamath Palihapitiya

― DeepSeek Panic, US vs China, OpenAI $40B?, and Doge Delivers with Travis Kalanick and David Sacks - All-In Podcast [Link]

Articles and Blogs

The Art of Leading Teammates - Harvard Business Review [Link]

A Team-Focused Philosophy

Put the team first, always, even when facing personal adversity.
Show appreciation for unsung colleagues.
Set the standard and create a culture of 100% effort.
Recognize teammates’ individual psychology and the best ways to motivate them.
Understand and complement the style of the formal leader.
Recognize and counteract the external forces that can cause selfish behavior.
Create opportunities to connect as people outside the office.

What Helps—and What Gets in the Way

The emotions and behaviors that define individuals are formed early.
Leaders work within a system.
It can be hard for individual team leaders to influence change across large organizations.
A leader’s style and influence will take time to evolve.

Early adopters of gen AI can eclipse rivals by using it to identify entirely new product opportunities, automate routine decisions and processes, deliver customized professional services, and communicate with customers more quickly and cheaply than was possible with human-driven processes.

Far from being a source of advantage, even in sectors where its impact will be profound, gen AI will be more likely to erode a competitive advantage than to confer one, because its very nature makes new insights and data patterns almost immediately transparent to anyone using gen AI tools.

If you already have a competitive advantage that rivals cannot replicate using gen AI, the technology may amplify the value you derive from that advantage.

Businesses that try to deny the power of gen AI will certainly fail. Those that adopt it will stay in the fight. But at this stage it looks likely that the only ones that will actually win with it will be those that can apply it to amplify the advantages they already have.

― AI Won’t Give You a New Sustainable Advantage - Harvard Business Review [Link]

To prevent this problem: Ask about this: Sample questions:

Conflating Correlation And Causation Approach to determining causality Was this analysis based on an experiment? If not, are there confounders (variables that affect the independent and dependent variables)?To what extent were they addressed in the analysis?

Misjudging The Potential Magnitude Of Effects Sample size and the precision of the results What was the average effect of the change? What was the sample size and the confidence interval (or range of likely values the true effect would fall into, and the degree to which one is certain it would fall into that range)? How would our course of action change, depending on where the true effect might lie?

A Disconnect Between What Is Measured And What Matters Outcome measures What outcomes were measured? Were they broad enough? Did they capture key intended and unintended consequences? Were they tracked for an appropriate period of time? Were all relevant outcomes reported? How do we think they map to broader organizational goals?

Misjudging Generalizability Empirical setting and subgroup analysis How similar is the setting of this study to our business context? Does the context or time period of the analysis make it more or less relevant to our decision? What is the composition of the sample being studied, and how does it influence the applicability of the results? Does the effect vary across subgroups or settings? Does this tell us anything about the generalizability of the results?

Overweighting A Specific Result Broader evidence and further data collection Are there other analyses that validate the results and approach? What additional data could we collect, and would the benefit of gathering it outweigh the cost of collecting it? How might this change our interpretation of the results?

― Where Data-Driven Decision-Making Can Go Wrong - Harvard Business Review [Link]

To prevent this problem:	Ask about this:	Sample questions:
Conflating Correlation And Causation	Approach to determining causality	Was this analysis based on an experiment? If not, are there confounders (variables that affect the independent and dependent variables)?To what extent were they addressed in the analysis?
Misjudging The Potential Magnitude Of Effects	Sample size and the precision of the results	What was the average effect of the change? What was the sample size and the confidence interval (or range of likely values the true effect would fall into, and the degree to which one is certain it would fall into that range)? How would our course of action change, depending on where the true effect might lie?
A Disconnect Between What Is Measured And What Matters	Outcome measures	What outcomes were measured? Were they broad enough? Did they capture key intended and unintended consequences? Were they tracked for an appropriate period of time? Were all relevant outcomes reported? How do we think they map to broader organizational goals?
Misjudging Generalizability	Empirical setting and subgroup analysis	How similar is the setting of this study to our business context? Does the context or time period of the analysis make it more or less relevant to our decision? What is the composition of the sample being studied, and how does it influence the applicability of the results? Does the effect vary across subgroups or settings? Does this tell us anything about the generalizability of the results?
Overweighting A Specific Result	Broader evidence and further data collection	Are there other analyses that validate the results and approach? What additional data could we collect, and would the benefit of gathering it outweigh the cost of collecting it? How might this change our interpretation of the results?

Will Psychedelics Propel Your Career? - Harvard Business Review [Link]

Do you want to take a 'trip'? lol

How Scalable Compute Resources Can Boost LLM Performance - HuggingFace [Link]

This blog explains how to scale test-time compute for models like OpenAI's o1 - apply dynamic inference strategies to improve performance without increasing pretraining budgets. These techniques allow smaller models to outperform larger models on tasks such as math problems.

We introduce deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering. We used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies, and draft safer responses.

― Deliberative alignment: reasoning enables safer language models - OpenAI [Link]

Moravec’s paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky, and others in the 1980s.

― Common misconceptions about the complexity in robotics vs AI - Harimus Blog [Link]

Yes, as Yann LeCun mentioned in one of his previous campus lectures, LLM might help but it is not the right solution for robotics. This article made several good points:

Sensorimotor Tasks Are More Complex. The source emphasizes that sensorimotor tasks are harder than many people realize. It was once assumed that perception and action were simple compared to reasoning, but this has turned out to be incorrect. This idea is known as Moravec's Paradox.
Real-World Interaction is the challenge. Robotics requires robots to interact with a dynamic, chaotic, and complex real world. Tasks that seem simple for humans, like picking up a coffee cup, involve complex, unconscious processes that are hard to program for a robot. Even small changes in the environment can require a complete rewrite of the robot's "move commands". Robots need to break down movements into muscle contractions and forces, which is more complex than it seems.
Data Requirements is another challenge. LLMs thrive on massive amounts of data, like text and images from the internet. Robotics requires precise, high-quality data that is hard to collect. The variety and preciseness of the data are also important. Unlike LLMs where quantity of data is key, in robotics, the quality of the data collected matters more than the quantity.

Regarding the question "do we need better hardware to learn", I think we need a system of sensors that can capture every physical movement of a body and every angle a body can perceive. In terms of a world model, the system needs to be on a larger scale.

OpenAI has created an AI model for longevity science - MIT Technology Review [Link]

OpenAI's success with GPT-4b micro demonstrates the potential of LLMs to go beyond natural language processing and address highly specialized scientific problems. The model's ability to redesign Yamanaka factors to improve their effectiveness by 50x could be a game-changer in stem cell research, accelerating advancements in regenerative medicine. This development highlights a significant milestone in the use of AI for scientific discovery, particularly in the field of protein engineering and regenerative medicine.

A classic pattern in technology economics, identified by Joel Spolsky, is layers of the stack attempting to become monopolies while turning other layers into perfectly-competitive markets which are commoditized, in order to harvest most of the consumer surplus; discussion and examples.

― Laws of Tech: Commoditize Your Complement - [Link]

This is exactly Meta's strategy initially competing with close-source AI model businesses - commoditize their complements to increase demand for their own products. And there are more examples mentioned in this article.

Core Concept:

Products have substitutes and complements. A substitute is an alternative product that can be bought if the first product is too expensive. A complement is a product usually bought together with another product. Demand for a product increases when the price of its complements decreases. Companies strategically try to commoditize their complements to increase demand for their own products. Commoditizing a complement means driving its price down to a point where many competitors offer indistinguishable goods. This strategy allows a company to become a quasi-monopolist and divert the majority of the consumer surplus to themselves.
How it works:

A company seeks a chokepoint or quasi-monopoly in a product composed of multiple layers. It dominates one layer of the stack while fostering competition in another layer. This drives down prices in the commoditized layer, increasing overall demand. The company profits from increased demand for its core product while the competitors in the commoditized layer struggle with low margins. The goal is to make a complement free or very cheap, to increase profits elsewhere. This strategy is an alternative to vertical integration.
Examples of Commoditization:
- Microsoft commoditized PC hardware by licensing its OS to many manufacturers, making the PC itself a commodity and increasing demand for MS-DOS.
- IBM commoditized the add-in market by using off-the-shelf parts and documenting the interfaces, allowing other manufacturers to produce add-on cards for their PCs, which increased the demand for PCs.
- Netscape open-sourced its web browser to commoditize browsers and increase demand for its server software.
- Various companies contribute to open-source software to commoditize software and increase demand for hardware and IT consulting services.
- Sun developed Java to make hardware more of a commodity.
- The Open Game License (OGL) was created to commoditize the Dungeons and Dragons system and drive sales of core rulebooks.
Open Source as a Strategic Weapon:

Open source can be a way for companies to commoditize their complements. It allows companies to share development costs and compete with dominant players. It can also neutralize advantages held by competitors and shift the focus of competition. Open sourcing can prevent a single company from locking up a technology.
Generalization:

Many products are composed of layers, each necessary but not sufficient for the final product. The final product is valuable, but the distribution of revenue among the different layers is contentious. Commoditizing complements is a way to control the market without vertical integration. The division of revenue is influenced by power plays and market dynamics.
Additional Examples:

The sources list many examples of commoditization in various industries, including hardware vs. software, banks vs. merchants, apps vs. OSes, game portals vs. game devs, telecom vs. users, and many more. The examples illustrate the breadth of this strategy across various tech and non-tech sectors. There are examples of companies commoditizing themselves, such as Stability AI, who commoditized image-generation models and saw little profit themselves.
Counter-Examples:
- Sun Microsystems' strategy of making both hardware and software a commodity was not successful.
- Some companies, like Apple, try to control both the hardware and software aspects of their products, which goes against the commoditization strategy.
Other Factors:

Antitrust actions can influence companies and prevent them from crushing competitors. Fear of antitrust actions may have stopped Microsoft from crushing Google.
Consequences:

The commoditization of complements can lead to intense competition in certain layers of the tech stack. It can also lead to a concentration of power and revenue in the hands of companies that control key chokepoints.

Reports and Papers

Mixtral of Experts [Link]

Key innovation: Sparse Mixture of Experts (SMoE) with TopK=2.

The state of Generative AI and Machine Learning at the end of 2023 - Intel Tiber AI Studio [Link]

Trends and insights of AI development and deployment in the enterprise - a survey result.

Does Prompt Formatting Have Any Impact on LLM Performance? [Link]

Prompt formats significantly affect LLM performance, with differences as high as 40% observed in code translation tasks for GPT-3.5-turbo. Larger models like GPT-4 demonstrate more resilience to prompt format changes.

JSON format outperformed Markdown in certain tasks, boosting accuracy by 42%. GPT-4 models exhibited higher consistency in responses across formats compared to GPT-3.5 models.

Deliberative Alignment: Reasoning Enables Safer Language Models [Link]

Their training methodology has two stages: 1) supervised fine-tuning on (prompt, CoT, output) datasets where CoTs explicitly reference safety policies, 2) high-compute RL using a reward model informed by safety policies, improving reasoning and adherence.

Genesis is a comprehensive physics simulation platform designed for general purpose Robotics, Embodied AI, & Physical AI applications. It is simultaneously multiple things:

A universal physics engine re-built from the ground up, capable of simulating a wide range of materials and physical phenomena.

A lightweight, ultra-fast, pythonic, and user-friendly robotics simulation platform.

A powerful and fast photo-realistic rendering system.

A generative data engine that transforms user-prompted natural language description into various modalities of data.

― Genesis: A Generative and Universal Physics Engine for Robotics and Beyond [Link]

Agents - Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic - Google [Link]

Agent AI: Surveying the Horizons of Multimodal Interaction [Link]

Foundations of Large Language Models [Link]

Atlas of Gray Matter Volume Differences Across Psychiatric Conditions: A Systematic Review With a Novel Meta-Analysis That Considers Co-Occurring Disorders [Link]

"Gray matter volume (GMV) differences across major mental disorders" refers to variations in the amount or density of gray matter in the brain when comparing individuals with mental disorders to those without. Gray matter consists of neuronal cell bodies, dendrites, and synapses and is essential for processing information, controlling movements, and supporting higher cognitive functions like memory, attention, and decision-making.

Structural Abnormalities: Mental disorders are often associated with changes in the brain's structure. GMV differences can highlight specific brain regions that are smaller, larger, or differently shaped in individuals with mental disorders.

Neurobiological Insights: Identifying GMV changes helps researchers understand the neurobiological basis of mental disorders and how these changes may contribute to symptoms like mood dysregulation, cognitive impairment, or altered behavior.

Target for Interventions: Understanding these differences can inform treatments such as targeted therapies, neurostimulation, or cognitive training to address the affected brain regions.

From Efficiency Gains to Rebound Effects: The Problem of Jevons’ Paradox in AI’s Polarized Environmental Debate [Link]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Link]

DeepSeek-R1 is an open-source reasoning model that matches OpenAI-o1 in math, reasoning, and code tasks.

News

NVIDIA Project DIGITS, A Grace Blackwell AI Supercomputer on your desk - NVIDIA [Link]

Constellation inks $1 billion deal to supply US government with nuclear power - Yahoo [Link]

Why 2025 will be the year of AI orchestration [Link]

2025 is anticipated to be the year of AI orchestration for several reasons:

In 2024, there was broad experimentation in AI, particularly with agentic use cases. In 2025, these pilot programs, experiments, and new use cases are expected to converge, leading to a greater focus on return on investment.
As organizations deploy more AI agents into their workflows, the need for infrastructure to manage them becomes more critical. This includes managing both internal workflows and those that interact with other services.
Decision-makers, especially those outside of the technology sector, are seeking tangible results from their AI investments. They are moving beyond experimentation and expect to see a return on their investment in 2025.
There will be a greater emphasis on productivity, which involves understanding how multiple agents can be made more effective. This will require a focus on accuracy and achieving higher productivity.
Many new orchestration options are emerging to address the limitations of existing tools such as LangChain. Companies are building orchestration layers to manage AI applications. These frameworks are still early in development, and the field is expected to grow.
There will be a focus on integrating agents across different systems and platforms, such as AWS's Bedrock and Slack, to allow for the transfer of context between platforms.
The emergence of powerful reasoning models like OpenAI's o3 and Google's Gemini 2.0 will make orchestrator agents more powerful.

Perplexity AI makes a bid to merge with TikTok U.S. - CNBC [Link]

OpenAI, Alphabet Inc.’s Google, AI media company Moonvalley and several other AI companies are collectively paying hundreds of content creators for access to their unpublished videos, according to people familiar with the negotiations.

― YouTubers Are Selling Their Unused Video Footage to AI Companies - Bloomberg [Link]

Stavridis says Trump’s plan for Greenland ‘not a crazy idea’ - The Hill [Link]

California’s Wildfire Insurance Catastrophe - WSJ [Link]

Rising premiums and limited coverage options could significantly impact Californians, particularly in wildfire-prone areas. The article calls out state leadership for failing to adapt policies to address climate-related risks effectively.

“Our robotics team is focused on unlocking general-purpose robotics and pushing towards AG –level intelligence in dynamic, real-world settings. Working across the entire model stack, we integrate cutting-edge hardware and software to explore a broad range of robotic form factors. We strive to seamlessly blend high-level AI capabilities with the physical constraints of physical.“

― OpenAI has begun building out its robotics team - VentureBeat [Link]

It's surprising because I kind of remember in a public interview Sam said he is not going to go hardware as it's going to be as efficient on that as those companies with hardware foundations like Tesla, NVIDIA, Meta, etc. Now, it's hiring its first hardware robotics roles as announced by Caitlin Kalinowski.

OpenAI’s $500B ‘Stargate Project’ could aid Pentagon’s own AI efforts, official says - Breaking Defense [Link]

This article highlights OpenAI's ambitious Stargate Project and its potential impact on both commercial and government sectors, particularly the U.S. Department of Defense (DoD). Stargate represents a bold step in building the next generation of AI infrastructure, and its success could profoundly influence the future of both private AI development and national security capabilities. The collaboration between industry leaders and government stakeholders will be key to overcoming technical and financial hurdles.

Here are key takeaways:

OpenAI's Stargate Project:

Objective: Build $500 billion worth of AI infrastructure, including new data centers and power solutions, primarily aimed at training and operating large AI models.
Initial Funding: $100 billion to be deployed immediately, with ongoing development starting in Texas and other potential sites in the U.S.
Collaborators: Japan-based SoftBank, Oracle, UAE-based MGX, NVIDIA, Microsoft, and Arm.

DoD Implications:

AI Challenges in Defense: The DoD faces significant bottlenecks in computing power to meet the demands of modern AI applications, from battlefield decision-making to intelligence analysis and coordinating multi-domain operations (CJADC2).
Reliance on Private Sector: Stargate could provide essential computing power to address the Pentagon's high-tech needs, especially where DoD lacks in-house capacity.
Field Applications: Supercomputing resources are essential for training and retraining AI models in dynamic environments, such as battlefield conditions where new inputs may arise.

Challenges:

Energy Demands: Generative AI models like ChatGPT consume immense electricity. The DoD must consider scalable and portable power sources, such as compact nuclear plants.
Funding Scrutiny: Despite public commitments, concerns about the financial capability of Stargate’s backers, including SoftBank, have raised questions.
Technical Constraints: Effective use of AI in military applications depends on robust, secure, and reliable infrastructure to handle high-bandwidth connections and avoid vulnerabilities to jamming.

Political and Economic Context:

The Stargate Project was announced at a high-profile White House event, underscoring its perceived importance to national interests.
Skepticism from figures like Elon Musk about the financial feasibility of such an enormous project adds to the intrigue surrounding its rollout.

Trump is planning 100 executive orders starting Day 1 on border, deportations and other priorities - AP News [Link]

A new neural-network architecture developed by researchers at Google might solve one of the great challenges for large language models (LLMs): extending their memory at inference time without exploding the costs of memory and compute. Called Titans, the architecture enables models to find and store during inference small bits of information that are important in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle both short- and long-term memory tasks efficiently. According to the researchers, LLMs that use neural long-term memory can scale to millions of tokens and outperform both classic LLMs and alternatives such as Mamba while having many fewer parameters.

― Google’s new neural-net LLM architecture separates memory components to control exploding costs of capacity and compute [Link]

TikTok restoring service after Trump vows to delay ban - AXIOS [Link]

TikTok's response to the Supreme Court decision [Link]

Amazon bought more renewable power last year than any other company - TechCrunch [Link]

Ozempic, Wegovy and other drugs are among 15 selected for Medicare’s price negotiations [Link]

Waymo Finds a Way Around US Restrictions Targeting Chinese Cars [Link]

More Speech and Fewer Mistakes - Meta News [Link]

NVIDIA Cosmos - NVIDIA [Link]

Announcing The Stargate Project - OpenAI [Link]

OpenAI announces the Stargate Project, a $\$500$ billion effort to create advanced AI infrastructure. The project begins with an immediate $\$100$ billion deployment for data centers, starting in Texas. It supports OpenAI’s goal of scaling artificial general intelligence (AGI) and training advanced AI models. Focus on high-value fields like personalized medicine and biotechnology.

NVIDIA GPUs power compute-intensive workloads. Oracle provides high-capacity cloud infrastructure. Microsoft Azure supports scalable distributed AI model training.

Introducing Operator - OpenAI [Link]

It's an AI agent that automates tasks directly in a web browser. You can use Operator to complete repetitive tasks like filling out forms, booking travel, or ordering items online. It uses a new model called Computer-Using Agent (CUA), which integrates GPT-4's vision capabilities with reinforcement learning to interact with graphical user interfaces (GUIs).

Introducing Citations on the Anthropic API - Anthropic [Link]

Understanding RLHF

Posted on 2024-12-29

Happy New Year (/≧▽≦)/

There is a lot to talk about Reinforcement Learning from Human Feedback (RLHF). How about starting with Reinforcement Learning (RL) basics.

Warning: Extremely long article ahead :)

Overview

The process of training a model using reinforcement learning from human feedback (RLHF) involves three key steps, as outlined in the paper titled “Training language models to follow instructions with human feedback” by OpenAI.

Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

The agent is the decision-maker or learner in the RL framework. It performs actions in the environment and learns from the feedback it receives. The environment represents everything external to the agent that it interacts with. It provides feedback in response to the agent’s actions. The state is a representation of the current situation of the environment as perceived by the agent. An action is a decision or move taken by the agent at each step based on its policy (a mapping from states to actions). The reward is a scalar feedback signal provided by the environment to indicate how good or bad an action was in achieving the agent’s goal.

An RL problem is typically formalized as a Markov Decision Process (MDP), which includes:

States ($S$): The set of all possible situations in which the agent can find itself.
Actions ($A$): The set of all possible moves or decisions available to the agent.
Transition Dynamics ($P(s'|s,a)$): The probability of transitioning to a new state $s'$ given the current state $s$ and action $a$.
Rewards ($R(s,a)$): The immediate feedback received after taking action $a$ in state $s$.
Policy ($\pi(a|s)$): A strategy that defines how the agent selects actions based on states.

The goal of RL is to find an optimal policy $\pi^*$ that maximizes cumulative rewards (also called return). This involves balancing short-term rewards with long-term planning using trial-and-error interactions with the environment.

The challenges arised from the nature of environment and its dynamics are non-stationary environments, stochastic rewards, and random states:

In non-stationary environments, the dynamics of the environment (e.g., transition probabilities or reward functions) change over time. This forces RL agents to continuously adapt their policies, which can lead to a drop in performance during the readjustment phase and forgetting previously learned policies.
Stochastic rewards occur when the reward function is probabilistic rather than deterministic. This introduces noise into the feedback signal, making it harder for the agent to discern which actions truly lead to higher rewards.
Random states refer to situations where the agent’s observations are noisy or partially observable, making it harder to infer the true state of the environment. Such randomness complicates policy learning because the agent may need to rely on memory or belief states (e.g., Partially Observable Markov Decision Processes, POMDPs) to make decisions. It increases the dimensionality and complexity of the state space.

The challenges related to algorithmic design and computational feasibility are:

RL algorithms require a significant amount of interaction with the environment to learn effectively, making them data-intensive. Many RL algorithms, particularly model-free methods like policy gradient techniques, require a large number of samples to converge.
RL agents face the exploration-exploitation dilemma, where they need to balance trying new actions to discover potentially better rewards (Exploration) and using known actions that yield high rewards (Exploitation).
Many RL problems involve enormous state and action spaces, such as games like Go or real-world robotics tasks. The exponential growth of possible states and actions makes it computationally challenging for RL algorithms to find optimal solutions.
Poorly designed rewards can lead to unintended behaviors (e.g., an agent exploiting loopholes in the reward structure). Sparse or delayed rewards make it difficult for the agent to associate its actions with outcomes.
RL agents often struggle to generalize learned behaviors across different tasks or environments. Agents trained in specific simulations (e.g., driving simulators) may fail to perform well in real-world scenarios due to differences in dynamics, noise, or variability.
RL algorithms are highly sensitive to hyperparameter choices (e.g., learning rate, discount factor). Poor tuning can lead to slow convergence or failure to converge at all, making training unpredictable and requiring significant expertise.
RL agents often use complex models (e.g., deep neural networks), making their decisions difficult to interpret. This lack of transparency is problematic in safety-critical applications like healthcare or autonomous driving, where understanding the reasoning behind decisions is essential.

Multi-Armed Bandit (MAB)

The multi-armed bandit (MAB) problem is a classic RL problem that exemplifies the exploration-exploitation tradeoff. It provides a simplified framework for decision-making under uncertainty.

Here is a simple scenario to help understand the Multi-Armed Bandit (MAB) problem. Imagine a doctor has three types of prescription drugs to treat a particular disease and $N$ patients to treat. At the beginning, the doctor has no knowledge about which drug is the most effective. The goal is to identify the best action—the drug that can cure the highest number of patients.

To achieve this goal, we can define action values as:

\[ Q_t(a) = E[R_t \mid A_t = a], \]

where: - $R_t$ is a random variable representing whether a patient is cured (reward), - $a$ is an action, which in this case corresponds to selecting a specific type of drug for the patients.

The best action is the one that maximizes the expected reward:

\[ a^* = \arg\max_a Q(a). \]

It’s important to note that an expectation, $E[x]$, is typically calculated as:

\[ E[x] = \sum x p(x), \]

where $p(x)$ represents the probability distribution of $x$. However, in real-world scenarios where $p(x)$ is unknown and data is limited, the expectation can be approximated using sample averages:

\[ E[x] \approx \frac{\sum x}{N}, \]

where $N$ is the total number of observations of $x$. This approximation process is known as Monte Carlo estimation.

The action value $Q_t(a)$ can be estimated by Sample-Average Method using the following formula:

\[ Q_t(a) = \frac{\text{Total rewards received when action } a \text{ was taken before time } t}{\text{Number of times action } a \text{ was taken before time } t}. \]

Mathematically, this can be expressed as:

\[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot I_{A_i = a}}{\sum_{i=1}^{t-1} I_{A_i = a}}, \]

where: - $R_i$ is the reward received at time step $i$, - $I_{A_i = a}$ is an indicator function that equals 1 if action $a$ was selected at time step $i$ and 0 otherwise.

The best action can be select by Greedy Approach: \[ a^* = \arg\max_a Q(a). \] In our case as demonstrated in the diagram below, after 4 times, $Q_t(1)=0.5, Q_t(2)=0.75, Q_t(3)=0.25$, the best action is determined by $\arg \max_a Q_t(a)$ which is Action $A_2$ ($a=2$).

However, this approach has some drawbacks such as small sample size and non-stationary environment (e.g. patients are in different consitions). An intuitive alternative is to give Action $A_1$ and Action $A_3$ more opportunities. This is called Exploration - Exploitation Tradeoff, which means to balance trying new actions to discover potentially better rewards (Exploration) and using known actions that yield high rewards (Exploitation).

A better approach is called Epsilon-Greedy Strategy which is a simple yet effective method for addressing the exploration-exploitation tradeoff in RL. It involves:

Exploration: With a probability of $\epsilon$, the agent chooses a random action, allowing it to explore the environment and gather new information.
Exploitation: With a probability of $1-\epsilon$, the agent selects the action that has the highest estimated reward (greedy action) based on its current knowledge.

In our case, let $\epsilon = 20\%$, $Q_t(1)$ and $Q_t(3)$ each is given $10\%$, and $Q_t(2)$ is given $80\%$. The next round (5th) of action is decided by random sampling $A_1,A_2,A_3$ with probability of $10\%, 80\%,10\%$. If the sampled action is $A_1$ and the reward is $1$, then its action value is updated to be is $Q_t'(1) = (0+1+1+0+1)/5 =0.6$.

The pseudo code of Epsilon Greedy Approach is as follows.

# Initialize
for a = 1 to K:
    Q(a) <- 0          # Estimated value of each arm
    N(a) <- 0          # Number of times each arm has been pulled

# Epsilon-Greedy Algorithm
for t = 1 to num_turns:
    with probability ε:
        A <- randomly select an arm (exploration)
    otherwise:
        A <- select the arm with the highest Q(a) (exploitation)

    # Pull the selected arm and observe reward R
    R <- Reward(A)

    # Update the estimates for the selected arm
    N(A) <- N(A) + 1
    Q(A) <- Q(A) + (1 / N(A)) * (R - Q(A))  # Incremental update formula

Note that there is a math trick in the incremental updates Q(A) <- Q(A) + (1 / N(A)) * (R - Q(A)). \[ \begin{equation} \begin{aligned} Q_{n+1} &= {1\over n}\sum^n_{i=1}R_i \space \text{ : average reward in the n+1 iteration}\\ &= {1\over n}(R_n + \sum^{n-1}_{i=1}R_i)\\ &= {1\over n}(R_n + (n-1){1\over n-1}\sum^{n-1}_{i=1}R_i)\\ &= {1\over n} (R_n + (n-1)Q_n)\\ &= {1\over n}(R_n+ n \times Q_n - Q_n)\\ &= Q_n + {1\over n} (R_n - Q_n) \end{aligned} \end{equation} \] The higher the $\epsilon$, the more opportunities given to actions, and the higher average reward.

(Source: Reinforcement Learning by Sutton and Barto, Chapter 2)

Agent

Long Term Goal

The goal of Agent is long-term reward \[ G_t = R_{t+1}+R_{t+2}+R_{t+3}+... \] So the objective is expected reward $E[G_t]$ \[ E[G_t] = E[R_{t+1}+R_{t+2}+R_{t+3}+...] \] There are different types of agent tasks:

Episodic Task: Episodic tasks consist of distinct episodes, where each episode has a clear beginning and end. At the end of an episode, the environment resets to a starting state.
Continuing Task: Continuing tasks involve ongoing interactions with no natural endpoint. The agent interacts with the environment indefinitely. A key challenge in continuing tasks is that the cumulative reward ($E[G_t]$) can become unbounded as time progresses. This makes it difficult to optimize an unbounded objective directly.

To make the objective bounded, a discount factor ($\gamma$) is introduced. The discount factor ensures that more weight is given to immediate rewards while gradually reducing the importance of future rewards. This approach stabilizes the optimization process. $\gamma \in (0,1)$ is a scalar that determines how much future rewards are discounted compared to immediate rewards. In practice, $\gamma$ is often set close to 1 (e.g., 0.95 or 0.98), allowing the agent to consider long-term rewards while still prioritizing recent ones.

The following derivations demonstrate how discounting makes the objective bounded. \[ \begin{equation} \begin{aligned} G_t &= \gamma R_{t+1}+\gamma^2 R_{t+2}+\gamma^3R_{t+3}+...+\gamma^{k-1} R_{t+k} ...\\ &=\sum^{\infty}_{k=0}\gamma^k R_{t+k+1}\\ &\leq \sum^{\infty}_{k=0} \gamma^k \times R_{max} \space \text{ ,where }R_{max} = \max\{R_{t+1},R_{t+2},...R_{t+k}\}\\ &=R_{max} \sum^{\infty}_{k=0} \gamma^k \\ &= R_{max} {1\over 1-\gamma} < \infty \end{aligned} \end{equation} \] The value of $\gamma$ influences how far-sighted or short-sighted the agent is. If $\gamma$ is large, the agent is far-sighted, meaning it prioritizes long-term rewards over immediate ones. If $\gamma$ is small, the agent is short-sighted, focusing heavily on immediate rewards while ignoring distant future outcomes.

The cumulative reward can also be written as follows, representing how the current cumulative reward is determined by the next step's reward and cumulative reward: \[ \begin{equation} \begin{aligned} G_t &= \gamma R_{t+1}+\gamma^2 R_{t+2}+...\\ &=R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + ...)\\ &=R_{t+1} + \gamma G_{t+1} \end{aligned} \end{equation} \]

Policy

The outcome of RL is policy $\pi$, which is a projection or mapping from input state $s$ to output action $a$.

Deterministic Policy: A deterministic policy maps each state to a single, specific action. In other words, given the same state, the agent will always select the same action. Deterministic policies may fail in environments with high uncertainty or noise, as they do not allow for the exploration of alternative actions. \[ \pi(s)=a \] Stochastic Policy: A stochastic policy maps each state to a probability distribution over possible actions. For a given state, the agent selects an action based on this distribution, meaning different actions can be chosen with varying probabilities. Requires learning and maintaining a probability distribution over actions, which can be computationally expensive. \[ \begin{aligned} \pi(a|s) &= P(a|s) \geq 1, \\ &\text{where }P(a|s) \text{ is the probability of selecting action a in state s.}\\ \sum_{a\in A(s)}\pi(a|s)&=1 \end{aligned} \]

Bellman Equations*

State-Value Functions and Action-Value Functions

State-Value Functions: denoted as $V_\pi(s)$, represents the expected cumulative future rewards starting from a particular state $s$ and following a specific policy $\pi$ thereafter. It measures the "goodness" of being in the state $s$, considering the long-term rewards achievable from that state under policy $\pi$. It does not depend on specific actions but rather on the overall behavior dictated by the policy. \[ \begin{aligned} V_\pi(s)= E_\pi[G_t|S_t=s], \\ \space G_t=\sum^\infty_{k=0}\gamma^kR_{t+k+1} \end{aligned} \] Action-Value Functions: denoted as $Q_\pi(s,a)$, represents the expected cumulative future rewards starting from the state $s$, taking action $a$, and then following a specific policy $\pi$ thereafter. It measures the “goodness” of taking action $a$ in state $s$, considering both immediate rewards and future rewards achievable under the policy $\pi$. It provides more granular information than $V(s)$, as it evaluates specific actions rather than just states. \[ Q_\pi(s,a) = E_\pi[G_t|S_t=s, A_t=a] \] The relationship between the state-value function and the action-value function can be expressed using the following formula: \[ V_\pi(s) =\sum_{a \in A} \pi(a|s) Q_\pi (s,a) \] This equation shows that the value of a state under policy $\pi$, $V_\pi(s)$, is the expected value of the action-value function $Q_\pi(s,a)$, weighted by the probability of taking each action $a$ in state $s$ according to policy $\pi(a|s)$.

State-Value Bellman Equation and Action-Value Bellman Equation

State-Value Bellman Equation:

(Source: The Bellman Equation: simplify our value estimation)

The State-Value Bellman Equation can be written in a recursive form. \[ \begin{aligned} V_\pi(s) &= E_\pi(G_t|S_t=s)\\ &= E_\pi(R_{t+1}+rG_{t+1}|S_t=s)\\ &=\sum_a \pi(a|s) \sum_{s'}\sum_r p(s',r|s,a) [r + V_\pi(s')] \end{aligned} \]

The tree structure below as an example can help understand the recursive property of the State-Value Bellman Equation. Note that an action $a$ does not necessarily lead to a specific state $s$, it can result in multiple possible states, each with a certain probability. These probabilities are determined by the environment, which we typically do not have direct access to.

Action-Value Bellman Equation:

(Source: Two types of value-based methods)

The Action-Value Bellman Equation can be written in a recursive form as well: \[ \begin{aligned} Q_\pi(s,a)&= E_\pi[G_t|S_t=s, A_t=a]\\ &=\sum_{s'}\sum_{r} P(s',r|s,a)[r+\gamma\sum_{a'}\pi(a'|s')Q_\pi(s',a')] \end{aligned} \] The tree structure below as an example can help understand the recursive property of the Action-Value Bellman Equation.

The main limitations of Bellman Equation:

In real-world problems, the number of states can be extremely large, requiring a separate Bellman equation for each state. This results in a system of simultaneous nonlinear equations due to the presence of the max operator, which can be difficult to solve.
Solving Bellman equations often requires iterative methods and can demand significant computational resources. This is particularly true when seeking high-precision approximations over many iterations.
In applications like the Bellman-Ford algorithm for finding shortest paths, the presence of negative cycles can pose a problem. If a cycle has a negative total sum of edges, it can lead to an undefined shortest path since iterating through the cycle can indefinitely reduce the path length.
The Bellman equation is inherently nonlinear because it involves maximizing over possible actions. This nonlinearity can complicate finding solutions, especially when dealing with large state spaces or complex reward structures.

Policy Iteration

In RL, an optimal policy is a policy that maximizes the expected cumulative reward (or return) for an agent across all states in a Markov Decision Process (MDP). This means that the state-value function $v_{\pi}(s)$ or the action-value function $q_{\pi}(s,a)$ under the optimal policy is greater than or equal to that of any other policy for all states and actions. \[ \pi^*(s) \geq \pi(s), \forall s \in \text{states} \] In finite MDPs, at least one optimal policy always exists. However, there may be multiple optimal policies that achieve the same maximum expected return.

Policy Iteration is a dynamic programming algorithm used in RL to compute the optimal policy $\pi^*$ for a Markov Decision Process (MDP). It alternates between two main steps: policy evaluation and policy improvement, iteratively refining the policy until convergence.

The full policy iteration algorithm pseudocode is in the figure below (Source: Sutton & Barto summary chap 04 - Dynamic Programming):

Here is a detailed explanation of this policy iteration's algorithm pseudocode.

Repeat steps until convergence:

Policy evaluation: keep current policy $\pi$ fixed, find value function $V(\cdot)$.

Iterate Bellman update until values converge:

\[ V(s) \leftarrow \sum_{s',r}p(s',r|s,\pi(s))[r+\gamma V(s')] \] The Bellman operator computes future rewards but discounts them by multiplying with $\gamma$. This ensures that differences in value functions become smaller with each iteration. In other words, Bellman shrinks distances. To see it mathematically,

The Bellman operator for the state-value function under a fixed policy $\pi$ is defined as \[ V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_s P(s|s,\pi(s))V(s') \] This operator updates the value function by combining the immediate reward and the discounted future rewards.

We compute the difference after applying the operator: \[ \Big|V^{\pi}_1(s)-V^{\pi}_2(s)\Big| = \Big|r(s,\pi(s))+\gamma\sum_s P (s|s,\pi(s))V_1(s')-\Big[r(s,\pi(s))+\gamma\sum_s P (s|s,\pi(s))V_2(s')\Big]\Big| \] Simplify by canceling out the immediate rewards, we get: \[ \Big|V^{\pi}_1(s)-V^{\pi}_2(s)\Big| = \gamma \Big|\sum_s P (s|s,\pi(s))\Big(V_1(s')-V_2(s')\Big)\Big| \] Since $\gamma<1$, the difference between $V_1^{\pi}(s)$ and $V_1^{\pi}(s)$ is always smaller than the difference between $V_1(s')$ and $V_1(s')$. Because Bellman operator shrinks distances, it is a contraction mapping and follows the contraction mapping property.

In summary, Policy evaluation is a contraction mapping for a fixed policy $\pi$. Policy evaluation converges because it applies a contraction mapping repeatedly to compute the value function for a fixed policy.
Policy improvement: find the best action for $V(\cdot)$ via one-step lookahead.

During policy improvement, the current policy $\pi$ is updated by selecting actions that maximize the expected return for each state $s$.

$\pi(s) \leftarrow \arg \max_a \sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]$

The intuition behind: $V^{\pi}(s)$ measures how good it is to start from the state $s$ and follow policy $\pi$. By improving the actions selected by the policy, we ensure that the agent transits into states with higher expected cumulative rewards. This iterative process ensures that each new policy improves upon or equals the previous one in terms of total expected rewards.

Overall, the idea of Policy Iteration can be demonstrated in the diagram below. The evaluation process usually takes a long time while the improvement process is usually fast.

The Generalized Policy Iteration (GPI) method can speed up the policy evaluation process. In GPI, policy evaluation process can be approximate or partial (e.g., only a few iterations instead of full convergence). Policy improvement can also be done incrementally or partially. GPI speeds up the process of finding an optimal policy by relaxing the strict requirements of full convergence in policy evaluation.

Monte Carlo Method

Monte Carlo methods are specifically designed for episodic tasks, where each episode eventually terminates, regardless of the actions taken by the agent. An episode refers to a complete sequence of interactions between an agent and the environment, starting from an initial state and ending in a terminal state.

Monte Carlo for Policy Evaluation

Value function and policy updates occur only after an episode is complete. By waiting for the end of an episode, Monte Carlo methods ensure that all rewards following a state are included in the return calculation. Monte Carlo methods learn purely from sampled experience (episodes), without requiring knowledge of transition probabilities or reward models. Episodes allow Monte Carlo methods to handle stochastic environments by averaging returns across multiple episodes.

Here is an example of calculating the state-value functions and action-action functions by Monte Carlo Method once 2 episodes are completed. Given states $S=[A,B,C,D,E], A=[1,2,3]$, and two episodes $E_1,E_2$, (Note: $A:(1,0.4)$ means state $A$, action $1$, and reward $0.4$) \[ \begin{aligned} &E_1 = \{A:(1,0.4), B:(2,0.5), A:(2,0.6), C:(2,0.1), B:(3,0.8), E:()\}\\ &E_2 = \{B:(2,0.5), A:(1,0.6), C:(2,0.3), A:(1,0.3), C:(2,0.8), E:()\} \end{aligned} \]

State-Value Functions Calculation

We can calculate $V(A),V(B),V(C),V(D),V(E)$.

e.g. there are 4 sequences starting from state $A$, then the state value function is: \[ \begin{aligned} V(A) &= [(0.4+\gamma 0.5 + \gamma^2 0.6 + \gamma^3 0.1 + \gamma^4 0.8)\\ &+(0.6+\gamma 0.1 + \gamma^2 0.8)\\ &+(0.6+\gamma 0.3 + \gamma^2 0.3 + \gamma^4 0.8)\\ &+(0.3+\gamma 0.8)] / 4 \end{aligned} \]
Action-Value Functions Calculation

We can calculate $Q(A,1),Q(B,2),\cdots$.

e.g. there are three sequences starting from $A:(1,)$, then the action value function is \[ \begin{aligned} Q(A,1) &= [(0.4+\gamma 0.5 + \gamma 0.6 + \gamma^3 0.1+ \gamma^4 0.8)\\ &+(0.6+\gamma 0.3 + \gamma^2 0.3 + \gamma^3 0.8)\\ &+(0.3+\gamma 0.8)]/3 \end{aligned} \]

The pseudocode for the above Monte Carlo for Policy Evaluation is as follows (Source: Reinforcement Learning by Sutton and Barto, Chapter 5):

As part of the algorithm, it loops for each step of episode from the end $T-1$ to the beginning of the episode. This allows for dynamic programming where some values can be stored and do not need to be re-calculated (see a simple demonstration below).

Monte Carlo for Policy Improvement

\[ \pi_{k+1}(s)=\arg \max_a q_{\pi_k}(s,a) \]

Here is an example of updating policy by action-value function. Given states $S=[A,B,C,D,E], A=[1,2,3], \gamma=0.5$, and a episode $E$, (Note: $A:(1,0.7)$ means state $A$, action $1$, and reward $0.7$) \[ E = \{A:(1,0.7), B:(1,0.4), A:(3,1.5), C:(2,0.1), B:(3,0.7), A:(1,0.3)\}\ \] Through dynamic programming, cumulative reward is $G_5(A,1)=0.3$, $G_4(B,3)=0.7+0.5*3=0.85$, $G_3(C,2)=0.1+0.5*0.85=0.52$, $G_2(A,3)=1.5+0.52*0.5=1.76$, $G_1(B,1)=0.4+1.76*0.5=1.28$, $G_0(A,1)=0.7+1.28*0.5=1.34$. We can maintain three lists to make the algorithm work:

Return matrix $Returns(S,A)$, dimension $(S, A)$: It stores cumulative reward values $G(S=s,A=a)$. One cell can store multiple values as the number of episodes increases.
$Q(S,A)$ matrix: It's initialized as random numbers at the beginning. Updated whenever the return matrix is updated. $Q(S,A)$ is the average value of the corresponding $Returns(S,A)$.
$\pi(s)$ list: It's updating Epsilon values by giving $1-\epsilon$ probability to the action with highest reward for each state, according to the updated $Q(S,A)$ matrix. It facilitates Epsilon Greedy Algorithm.

The final updating result of the above example is in the diagram below.

The pseudocode for the above Monte Carlo for Policy Improvement with action-value function is as follows (Source: Reinforcement Learning by Sutton and Barto, Chapter 5):

Main Limitation

Policy updates occur only after an episode is completed, which can slow down learning compared to methods like Temporal Difference (TD) learning that update incrementally after each step.
MC methods do not use bootstrapping (i.e., they do not update value estimates based on other estimates). While this avoids bias, it also means MC methods cannot leverage intermediate value estimates, leading to slower convergence.

Temporal Difference Learning

TD leanring focuses on estimating the value function of a given policy by updating value estimates based on the difference between successive predictions, rather than waiting for an entire episode to conclude.

Given \[ \begin{aligned} V(S_t) &\leftarrow V(S_t) + \alpha \Big[G_t - V(S_t)\Big]\\ G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = R_{t+1} + \gamma G_{t+1}\\ V_{\pi}(s) &= E_\pi[G_t|S_t=s] = E_\pi\Big[R_{t+1} + \gamma G_{t+1}| S_t=s\Big] = R_{t+1} + \gamma V_\pi(S_{t+1}) \end{aligned} \] We can derive the core function of TD: \[ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(R_{t+1}) - V(S_t)] \] The pseudocode of TD learning is as follows. (Source: Sutton & Barto summary chap 06 - Temporal Difference Learning)

From table to function approximation

Main limitations of table based methods:

As the number of state variables increases, the size of the state space grows exponentially.
Storing a value table becomes impractical for large or continuous state/action spaces due to memory limitations.
Table-based methods treat each state (or state-action pair) independently. They cannot generalize knowledge from one part of the state space to another, requiring every state to be visited multiple times for accurate learning.
In large state spaces, it is unlikely that an agent will visit all relevant states/actions frequently enough to converge to an optimal policy within a reasonable time frame.
Table-based methods are well-suited for small problems but fail to scale to real-world applications such as autonomous driving, robotics, or complex games like Go or StarCraft.

From tabular to parametric functions:

Fit a parametric function to approximate the value function $V(s)$ which maps states $s$ to their corresponding value estimates. \[ f(s,\theta) \approx V(s) \] where \[ f(s,\theta)=w^Ts+b \] To optimize this approximation, we minimize the mean squared error (MSE) loss between the observed value $v(s)$ and predicted $\hat{v}(s,{w})$ from Monte Carlo. $\mu(s)$ is the probability distribution over states. This loss ensures that the predicted values are as close as possible to the observed values. \[ \ell = \min \sum_s \mu(s) \Big[v_\pi(s)-\hat{v}(s,{w})\Big]^2 \] The optimal ${w}$ that minimize the loss can be found by batch Gradient Descent ($\eta$ is learning rate). \[ w \leftarrow w - \eta \nabla \ell(w) \] where \[ \begin{aligned} \nabla \ell(w) &= \nabla \sum_s \mu(s) \Big[v_\pi(s) - \hat{v}(s,{w})\Big]^2\\ &= \sum_s \mu(s) \nabla \Big[v_\pi(s) - \hat{v}(s,{w})\Big]^2\\ &= -2 \sum_s \mu(s) \Big[v_\pi(s) - \hat{v}(s,{w})\Big]\nabla \hat{v}(s,{w}) \end{aligned} \] From Gradient Descent to Stochastic Gradient Descent:

While batch gradient descent computes gradients over the entire dataset (all states), this can be computationally expensive for large-scale problems. Instead, stochastic gradient descent (SGD) updates the parameters incrementally using one observation at a time. Given observations $(S_1, v_\pi(S_1)), (S_2, v_\pi(S_2)), (S_3, v_\pi(S_3)), ...$, SGD performs updates as follows ($\alpha$ is learning rate). \[ \begin{aligned} {w}_2 &= {w}_1 + \alpha \Big[v_\pi(S_1) - \hat{v}(S_1,{w_1})\Big] \nabla \hat{v}(S_1,{w}_1)\\ {w}_3 &= {w}_2 + \alpha \Big[v_\pi(S_2) - \hat{v}(S_2,{w_2})\Big] \nabla \hat{v}(S_2,{w}_2)\\ &\cdots \end{aligned} \] The algorithm of Gradient Monte Carlo is as follows. (Source: Reinforcement Learning by Sutton and Barto)

PPO Prior

Average Reward

The average reward is an alternative to the commonly used discounted reward framework. It measures the long-term average reward per time step under a given policy, making it particularly suitable for continuing tasks (non-episodic problems) where there is no natural endpoint or terminal state.

The average reward framework is particularly useful for continuing tasks, where:

The task does not have a natural termination point (e.g., robot navigation, server optimization, or industrial control systems).
The agent operates indefinitely, and evaluating its performance based on long-term behavior (rather than episodic returns) is more meaningful.

The average reward for a policy is defined as: \[ r(\pi)=\lim_{h \rightarrow \infty} {1\over h}\sum^h_{t=1} E\Big[R_t | S_0, A_{0:t-1} \sim \pi\Big] \] This simple example shows how average reward is calculated:

Differential Return

The differential return in RL is a concept that arises in the average reward framework, particularly for continuing tasks. It measures the cumulative deviation of rewards from the long-term average reward rate, $r(\pi)$ , under a given policy $\pi$.

Differential return aligns with the goal of maximizing long-term performance in continuing tasks by focusing on deviations from steady-state behavior. Unlike discounted returns, differential return does not rely on a discount factor $\gamma$. This avoids biases introduced by choosing an arbitrary discount factor. It is particularly well-suited for tasks with no natural termination, such as robotics or industrial control systems.

The differential return at time step $t$ is defined as: \[ G_t = R_{t+1} - r(\pi)+R_{t+2} - r(\pi)+R_{t+3} - r(\pi) \] Then the value functions can be rewritten as \[ \begin{aligned} v_\pi(s) = \sum_a \pi(a|s)\sum_{r,s'} p(s',r|s,a) \Big[r-r(\pi) +v_{\pi}(s')\Big]\\ q_\pi(s,a) = \sum_{r,s'} p(s',r|s,a) \Big[r-r(\pi) +\sum_{a'}\pi(a'|s')q_{\pi}(s',a')\Big] \end{aligned} \] Algorithms like Gradient Monte Carlos can be rewritten by using this differential return.

Policy Gradient and REINFORCE

Objective of policy gradient: \[ \begin{aligned} J(\theta) &= \sum_{s\in S}d^{\pi}(s)V^\pi(s) = \sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a)\\ d^\pi(s) &= \lim_{t\rightarrow \infty}P(s_t=s|s_o,\pi_\theta) \rightarrow \text{ converege (Markov Property)}\\ &\max J(\theta): \\ &\max \sum_{s\in S} d^\pi(s) V^\pi(s) \implies \theta \leftarrow \theta + \nabla_\theta J(\theta) \end{aligned} \] Policy gradient theorem allows us to compute the gradient of the expected return with respect to the parameters of a parameterized policy, enabling optimization through gradient ascent. \[ \begin{aligned} \nabla_\theta J(\theta) &= \nabla _\theta\Big[\sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a)\Big]\\ &\propto \sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s)\\ &\implies \theta \leftarrow \theta + \eta\Big[\sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s) \Big] \end{aligned} \] Since Monte Carlo involves a sampling step, which requires an expectation form. The gradient derived above can be re-written as follows, supporting gradient estimation. (Recall: $(\ln x)' = 1/x$) \[ \begin{aligned} \nabla_\theta J(\theta) &\propto \sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s)\\ &=\sum_{s\in S}d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a) {\nabla_{\theta} \pi_\theta(a|s)\over \pi_\theta(a|s)}\\ &=E_\pi\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ (&=E_{s \sim \pi, a \sim \pi_\theta(a|s)}\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big])\\ \end{aligned} \] Given the above theorems, a Reinforce Algorithm - Monte-Carlo Policy-Gradient algorithm is defined as follows. (Source: Sutton & Barto summary chap 13 - Policy Gradient Methods)

A differentiable policy ensures that small changes in $\theta$ result in smooth changes in the action probabilities $\pi(a|s,\theta)$. This is crucial for stable and efficient learning. The softmax function is commonly used to parameterize policies in discrete action spaces. The softmax function is smooth and differentiable, enabling gradient-based optimization. Softmax ensures that all actions have non-zero probabilities, promoting exploration during training.

Here the differentiable policy parameterization $\pi(a|s,{\theta})$ can be defined by \[ \pi(a|s,\theta) = {\exp(h(s,a,\theta)) \over \sum_{b\in A} \exp(h(s,b,\theta))} \] where $h(s,a,\theta)=w_a^T+b_a$ is a linear or non-linear function representing the preference for action $a$. The denominator normalizes the probabilities so that they sum to 1.

The log of the softmax function has a convenient derivative that simplifies gradient computation: \[ \nabla_\theta\ln \pi(a,s,\theta) = \nabla_\theta h(s,a,\theta) - \sum_b \pi(b|s,\theta) \nabla_\theta h(s,b,\theta) \]

Main Limitations of REINFORCE

REINFORCE requires complete episodes to compute the return for each state, as it relies on Monte Carlo estimates of the expected return. This makes it unsuitable for continuing tasks (non-episodic environments) where there is no clear terminal state.
The gradient estimates in REINFORCE have high variance because they depend on the total return from sampled episodes, which can vary significantly due to stochasticity in the environment and policy.
Since REINFORCE updates the policy only after completing an episode, it does not make use of intermediate data. This results in poor sample efficiency, requiring a large number of episodes to learn effectively.
Unlike Temporal Difference (TD) methods, REINFORCE does not use bootstrapping (i.e., it does not update value estimates based on other estimates). It relies solely on complete returns from episodes.
The algorithm is highly sensitive to the choice of the learning rate. A poorly chosen learning rate can lead to divergence or extremely slow convergence.

Advantage Function

Advantage function employs the idea of differential return. In the REINFORCE algorithm, with advantage function, policy gradient can be re-written as \[ \begin{aligned} \nabla_\theta J(\theta) &=E_\pi\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ &=E_\pi\Big[A^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ \end{aligned} \] where \[ \begin{aligned} A^{\pi}(s,a) &= Q^\pi(s,a)-V^\pi(s)\\ V^\pi(s) &= \sum_{a\in A}\pi(a|s) Q(s,a) \end{aligned} \]

Off-Policy Policy Gradient

Off-policy policy gradient methods allow learning a target policy while using data generated from a different behavior policy. By reusing past experiences and learning from suboptimal actions, off-policy methods can significantly improve sample efficiency. Off-policy learning allows for better exploration strategies since it can incorporate data from various policies, including exploratory ones.

The policy gradient estimate is defined as \[ \nabla_\theta J(\theta) = E_\beta\Big[{\pi_\theta(a|s)\over \beta(a|s)} Q^\pi(s,a) \nabla _\theta \ln \pi_\theta (a|s)\Big] \] $\beta(a|s)$ refers to the behavior policy that generates the data used for training. The behavior policy is not necessarily the optimal policy we want to learn (the target policy $\pi(a|s)$). Instead, it can be any policy that provides useful exploration of the state-action space. ${\pi_\theta(a|s)\over \beta(a|s)}$ is the important weight for sampling.

Trust Region Policy Optimization (TRPO)

TRPO is an advanced policy gradient method in RL designed to optimize policies while ensuring stable and reliable updates. It addresses some of the limitations of traditional policy gradient methods by incorporating a trust region constraint that limits how much the policy can change in a single update. The difference between REINFORCE and TRPO is that TRPO uses off-policy policy gradient and advantage function, as well as a constraint on the Jullback-Leibler (KL) divergence between the old and new policy.

Recall that REINFORCE's objective of policy gradient is: \[ J(\theta) = \sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a) \] The derivation of TRPO's objective of policy gradient is: \[ \begin{aligned} J(\theta) &=\sum_{s \in S} d^{\pi}(s) \sum_{a\in A} (\pi_\theta(a|s) \hat{A}_{\theta_{old}}(s,a))\\ &=\sum_{s\in S} d^{\pi_{\theta_{old}}} \sum_{a\in A} (\beta(a|s) {\pi_\theta(a|s)\over \beta(a|s)} \hat{A}_{\theta_{old}}(s,a))\\ &=E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \beta}\Big[{\pi_\theta(a|s)\over \beta(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ &=E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \pi_{\theta_{old}}}\Big[{\pi_\theta(a|s)\over \pi_{\theta_{old}}(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ E_{s\sim d^{\pi_{\theta_{old}}}}&\Big[D_{KL}\Big(\pi_{\theta_{old}}(\cdot | s)||\pi_\theta(\cdot|s)\Big)\Big] \leq \delta \end{aligned} \] The TRPO constrained optimization is defined as \[ \begin{aligned} &\max E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \pi_{\theta_{old}}}\Big[{\pi_\theta(a|s)\over \pi_{\theta_{old}}(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ & s.t. E_{s\sim d^{\pi_{\theta_{old}}}}\Big[D_{KL}\Big(\pi_{\theta_{old}}(\cdot | s)||\pi_\theta(\cdot|s)\Big)\Big] \leq \delta \end{aligned} \] One of the main limitations of TRPO is that the constrained optimization problem can be computationally intensive, especially for large state and action spaces.

PPO

PPO Objective

To address the computational expense of the constrained optimization in TRPO, researchers introduced the CLIP objective in policy gradient methods. The CLIP objective simplifies the optimization process while maintaining stable policy updates. Below are the TRPO objective and its corresponding CLIP version: \[ \begin{aligned} J^{TRPO}(\theta) &= E[r(\theta)\hat{A}_{\theta_{old}}(s,a)]\\ J^{CLIP}(\theta) &= E[\min(r(\theta)\hat{A}_{\theta_{old}}(s,a)), \text{clip}(r(\theta),1-\epsilon, 1+\epsilon) \hat{A}_{\theta_{old}}(s,a)] \end{aligned} \] where \[ \begin{aligned} &r(\theta) = {\pi_{\theta (a|s)} \over \pi_{\theta_{old}(a|s)}}\\ &r(\theta) \in [1-\epsilon, 1+ \epsilon], \\ &i.e. 1-\epsilon \leq r(\theta) \leq 1+\epsilon \end{aligned} \] If $r(\theta) > 1+\epsilon, r(\theta)=1+\epsilon$. If $r(\theta) < 1-\epsilon, r(\theta) = 1-\epsilon$.

The CLIP objective ensures that the policy ratio $r(\theta)$ does not deviate too far from 1 (the old policy), thereby limiting large updates to the policy. The term $J^{CLIP}(\theta)$ takes the minimum of the “unclipped” objective and the “clipped” version. This prevents overly large policy updates by removing the lower bound when $r(\theta)$ is outside the clipping range.

The Proximal Policy Optimization (PPO) algorithm (Paper: Proximal Policy Optimization Algorithms) extends the CLIP objective by incorporating additional terms for value function optimization and entropy regularization. The full PPO objective is defined as: \[ J^{PPO}(\theta) = E[J^{CLIP}(\theta) - c_1(V_\theta(s)-V_{target})^2 + c_2 H(s,\pi_\theta(\cdot))] \] where

$-(V_\theta(s) - V_{target})^2$ is the negative mean squared error (MSE), which we aim to maximize. It minimizes the difference between the predicted value function $V_\theta(s)$ and the target value $V_{target}$. The coefficient $c_2$ controls the tradeoff between policy optimization and value function fitting.
$H(s,\pi_\theta(\cdot))$ represents the entropy of the policy. Maximizing entropy encourages exploration by preventing premature convergence to deterministic policies. The coefficient $c_2$ determines the weight of this entropy term.

Here is a pseudocode of PPO-Clip Algorithm (Source: OpenAI Spinning Up - Proximal Policy Optimization)

PPO Usage

State, Action, and Reward in the Context of LLMs

In the context of LLMs, the components of reinforcement learning are defined as follows:

State: The state corresponds to the input prompt or context provided to the language model. It represents the scenario or query that requires a response.
Action: The action is the output generated by the language model, i.e., the response or continuation of text based on the given state (prompt).
Reward: The reward is a scalar value that quantifies how well the generated response aligns with human preferences or task objectives. It is typically derived from a reward model trained on human feedback.
Policy: A policy refers to the strategy or function that maps a given state (input prompt and context) to an action (the next token or sequence of tokens to generate). The policy governs how the LLM generates responses and is optimized to maximize a reward signal, such as alignment with human preferences or task-specific objectives.

Steps of RLHF Using PPO

The RLHF process using PPO involves three main stages:

Training a Reward Model: A reward model is trained to predict human preferences based on labeled data. Human annotators rank multiple responses for each prompt, and this ranking data is used to train the reward model in a supervised manner. The reward model learns to assign higher scores to responses that align better with human preferences.
Fine-Tuning the LLM with PPO: After training the reward model, PPO is used to fine-tune the LLM. The steps are as follows:
1. Initialize Policies: Start with a pre-trained LLM as both the policy model (actor) and optionally as the critic for value estimation.
  - The actor is the language model that generates responses (actions) based on input prompts (states).
    
    For example: Input: “Explain quantum mechanics.” Output: “Quantum mechanics is a branch of physics that studies particles at atomic and subatomic scales.”
  - The critic is typically implemented as a value function, which predicts how good a particular response (action) is in terms of achieving long-term objectives. This model predicts a scalar value for each token or sequence, representing its expected reward or usefulness.
    
    For example:
    
    Input: “Explain quantum mechanics.” → “Quantum mechanics is…” Output: A value score indicating how well this response aligns with human preferences or task objectives.
  - Both the actor and critic can be initialized from the same pre-trained LLM weights to leverage shared knowledge from pretraining. However, their roles diverge during fine-tuning: The actor focuses on generating responses. The critic focuses on evaluating those responses.
2. Collect Rollouts: Interact with the environment by sampling prompts from a dataset. Generate responses (actions) using the current policy. Compute rewards for these responses using the trained reward model.
3. Compute Advantage Estimates: Use rewards from the reward model and value estimates from the critic to compute advantages: \[ \hat{A}(s, a) = R_t + \gamma V(s_{t+1}) - V(s_t), \] where $ R_t $ is the reward from the reward model.
4. Optimize Policy with PPO Objective: Optimize the policy using PPO's clipped surrogate objective: \[ J^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r(\theta)\hat{A}(s, a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A}(s, a)\right)\right], \] where $ r() = $ is the probability ratio between new and old policies.
5. Update Value Function: Simultaneously update the value function by minimizing mean squared error between predicted values and rewards: \[ \mathcal{L}_{\text{value}} = \mathbb{E}\left[(V_\theta(s) - R_t)^2\right]. \]
6. Repeat: Iterate over multiple epochs until convergence, ensuring stable updates by clipping policy changes.
Evaluation: Evaluate the fine-tuned LLM on unseen prompts to ensure it generates outputs aligned with human preferences. Optionally, collect additional human feedback to further refine both the reward model and policy.

The following diagrams summarizes the high-level RLHF process with PPO, from preference data creation, to training a reward model, and using reward model in an RL loop to fine tune LLM.

The following workflow chart illustrates the more detailed training process of RLHF with PPO. (Source: Secrets of RLHF in Large Language Models Part I: PPO)

RLHF Training Tricks

There are practical challenges that arise during RLHF training. These challenges stem from the inherent complexities of RL, especially when applied to aligning LLMs with human preferences. Therefore, tricks are essential for addressing the practical limitations of RLHF, ensuring the training process remains efficient, stable, and aligned with human preferences while minimizing the impact of inherent challenges in RL systems. (Source: Secrets of RLHF in Large Language Models Part I: PPO)

DPO

Bradley-Terry and Plackett-Luce Reward Model

The Bradley-Terry (BT) model is a probabilistic model used to compare pairwise preferences. It assumes that each item (e.g., a response or completion) has an intrinsic quality score, and the probability of one item being preferred over another depends on the relative difference in their scores.

Mathematically, the probability of option $y_1$ being preferred over option $y_2$ is given by: \[ P(y_1 \succ y_2|x) = {\exp(r(x,y_1)) \over \exp(r(x,y_1)) + \exp(r(x,y_2))} \] The loss of this reward model is \[ L_R(r_{\phi},D) = -E_{(x,y_w,y_l)\sim D} \Big[\log \sigma\Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big] \] However, the BT model has some limitations:

It assumes transitivity in preferences (if $A>B$ and $B>C$, then $A >C$), which may not always hold in real-world data.
It only handles pairwise comparisons and does not naturally extend to rankings involving more than two items.

The Plackett-Luce (PL) model generalizes the Bradley-Terry model to handle rankings of multiple items, not just pairwise comparisons. It models the probability of a ranking as a sequence of choices. The first-ranked item is chosen based on its relative worth compared to all other items. The second-ranked item is chosen from the remaining items, and so on.

Mathematically, for a ranking $i_1\succ i_2 \succ ... \succ i_J$, the probability is given by: \[ P(i_1\succ i_2 \succ ... \succ i_J ) = \prod^j_{j=1}{\alpha_{i_j}\over \sum^J_{k=j} \alpha_{i_k}} \] where $\alpha_{i_j}$ is the worth or quality score of item $i_j$. The denominator normalizes over all remaining items at each step.

The PL model has several advantages over the BT model:

Unlike BT, which only works with pairwise comparisons, PL can handle rankings involving multiple items.
PL can accommodate partial rankings (e.g., ranking only the top n items), making it more versatile in scenarios where full rankings are unavailable.
When human feedback involves ranking multiple responses rather than just picking one as better, PL captures this richer information better than BT.

DPO Objective

The main reason why RLHF with PPO is hard is that it takes a lot of redundant effort. Policy Model is all we need, all other efforts are not necessary. DPO (Direct Preference Optimization) is a novel alternative to traditional RLHF for fine-tuning LLMs. It simplifies the RLHF process by eliminating the need for complex reward models and RL algorithms. Instead, DPO reframes the problem of aligning LLMs with human preferences as a classification problem using human-labeled preference data.

The idea is DPO and difference between DPO and PPO are shown in the figure below (Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

Recall the Bradley-Terry reward model: \[ \begin{aligned} P(y_1 \succ y_2|x) &= {\exp(r(x,y_1)) \over \exp(r(x,y_1)) + \exp(r(x,y_2))}\\ L_R(r_{\phi},D) &= -E_{(x,y_w,y_l)\sim D} \Big[\log \sigma\Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big] \end{aligned} \] RLHF objective is defined as follows. Keep in mind that no matter whether DPO or PPO is used, the objective is always like this. \[ \max_{\pi_\theta} E_{x \sim D, y \sim \pi_\theta(y|x)}\Big[r_{\phi}(x,y) - \beta D_{KL}\big[\pi_\theta(y|x) || \pi_{ref}(y|x)\big]\Big] \] where $\beta D_{KL}\big[\pi_\theta(y|x) || \pi_{ref}(y|x)\big]$ is a regularization term. When applying RL to NLP, regularization is often needed. Otherwise RL would explore every possible situation and find out hidden tricks which deviate from a language model.

The optimal policy $\pi_r(y|x)$ that maximizes the objective is \[ \begin{aligned} \pi_r(y|x) &= {1\over Z(x)}\pi_{ref}(y|x)\exp\Big({1\over \beta}r(x,y)\Big)\\ Z(x) &= \sum_y \pi_{ref}(y|x) \exp\Big({1\over \beta}r(x,y)\Big) \end{aligned} \] where $\pi_r(y|x)$ is a probability distribution.

Based on this optimal policy, we can derive the reward function for the optimal policy \[ r(x,y)=\beta \log{\pi_r(y|x)\over \pi_{ref}(y|x)} + \beta \log Z(x) \] If we put this reward function in the Bradley-Terry model, we obtain a probability of $y_1$ being prefered to $y_2$. \[ \begin{aligned} P^*(y_1 \succ y_2|x) &= {\exp^*(r(x,y_1)) \over \exp^*(r(x,y_1)) + \exp^*(r(x,y_2))}\\ &={\exp(\beta \log{\pi^*_r(y_1|x)\over \pi_{ref}(y_1|x)} + \beta \log Z(x)) \over \exp(\beta \log{\pi^*_r(y_1|x)\over \pi_{ref}(y_1|x)} + \beta \log Z(x)) + \exp(\beta \log{\pi^*_r(y_2|x)\over \pi_{ref}(y_2|x)} + \beta \log Z(x))}\\ &={1\over 1+\exp\Big(\beta \log {\pi^*(y_2|x) \over \pi_{ref}(y_2|x)} - \beta\log {\pi^*(y_1|x)\over \pi_{ref}(y_1|x)}\Big)}\\ &=\sigma\Big(\beta \log {\pi^*(y_1|x)\over \pi_{ref}(y_1|x)} - \beta \log {\pi^*(y_2|x)\over \pi_{ref}(y_2|x)}\Big)\\ \end{aligned} \] With this probability, we have DPO's objective function below. We can optimize this loss function by Maximum Likelihood Extimation: \[ L_{DPO}(\pi_\theta; \pi_{ref}) = -E_{(x,y_w,y_l) \sim D} \Big[\log \sigma \Big(\beta \log {\pi_{\theta}(y_w|x)\over \pi_{ref}(y_w|x)} - \beta \log {\pi_{\theta}(y_l|x)\over \pi_{ref}(y_l|x)}\Big)\Big)\Big] \] Key Ideas of DPO Objective:

DPO's objective aims to increase the likelihood of generating preferred responses over less preferred ones. By focusing directly on preference data, DPO eliminates the need to first fit a reward model that predicts scalar rewards based on human preferences. This simplifies the training pipeline and reduces computational overhead.
Value functions exist to help reduce the variance of the reward model. In DPO, the value function is not involved because DPO does not rely on a traditional RL framework, such as Actor-Critic methods. Instead, DPO directly optimizes the policy using human preference data as a classification task, skipping the intermediate steps of training a reward model or estimating value functions.
DPO was originally designed to work with pairwise preference data, however, recent advancements and adaptations have extended its applicability to ranking preference data as well (e.g RankDPO).

DPO paper has provided detailed steps of deriving the gradient of the DPO objective: (Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

The simplified version of the DPO gradient for better understanding is written as follows. Intuitively, when the difference between $\hat{r}_{\theta}(x, y_l)$ and $\hat{r}_{\theta}(x, y_w)$ is large, the gradient takes a larger step during optimization. Conversely, when the difference is small, the objective is optimized with a smaller adjustment.

DPO Usage

Here's how DPO is applied step by step:

1. Initial Setup and Supervised Fine-Tuning (SFT): Begin by fine-tuning a pre-trained LLM using supervised learning on a dataset that is representative of the tasks the model will perform. This step ensures the model has a strong foundation in the relevant domain, preparing it for preference-based optimization.

2. Collect Preference Data: Gather human feedback in the form of pairwise preferences or rankings. Annotators evaluate responses generated by the model and indicate which ones they prefer. Construct a dataset of prompts and corresponding preferred and less-preferred responses.

3. Iterative Rounds of DPO

Sampling and Annotation: In each round, sample a set of responses from the model for given prompts. Collect new preference annotations based on these samples, allowing for dynamic updates to the preference dataset. (Public preference data works as well. Off-policy and on-policy data both work).
Preference Optimization: Use DPO to adjust the model's outputs based on collected preference data:
Model Update: Fine-tune the model using this loss function to increase the likelihood of generating preferred responses.

4. Evaluation and Iteration

Performance Assessment: After each round, evaluate the model’s performance on new prompts to ensure it aligns with human preferences. Use feedback from these evaluations to inform subsequent rounds of sampling and optimization.
Iterative Refinement: Continue this loop process over multiple rounds, iteratively refining the model's alignment with human preferences through continuous sampling and preference optimization.

DPO Performance

(Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

DPO Objective Pseudocode

\[ L_{DPO}(\pi_\theta; \pi_{ref}) = -E_{(x,y_w,y_l) \sim D} \Big[\log \sigma \Big(\beta \log {\pi_{\theta}(y_w|x)\over \pi_{ref}(y_w|x)} - \beta \log {\pi_{\theta}(y_l|x)\over \pi_{ref}(y_l|x)}\Big)\Big)\Big] \]

import torch.nn.functional as F

def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
    """
    pi_logps: policy logprobs, shape (B,)
    ref_logps: reference model logprobs, shape (B,)
    yw_idxs: preferred completion indices in [0, B-1], shape (T,)
    yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
    beta: temperature controlling strength of KL penalty

    Each pair of (yw_idxs[i], yl_idxs[i]) represents the
    indices of a single preference pair.
    """

    pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]
    ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]

    pi_logratios = pi_yw_logps - pi_yl_logps
    ref_logratios = ref_yw_logps - ref_yl_logps

    losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
    rewards = beta * (pi_logps - ref_logps).detach()

    return losses, rewards

DPO Variants

The key area of research involves developing variants of DPO and conducting theoretical analyses to understand its limitations and potential improvements. This includes exploring different loss functions or optimization strategies that can be applied within the DPO framework.

One significant area of research focuses on refining the loss function used in DPO. This includes exploring ways to eliminate the need for a reference model, which can simplify the optimization process.

Examples:
- ORPO: Monolithic Preference Optimization without Reference Model
- SimPO: Simple Preference Optimization with a Reference-Free Reward
Another key direction involves leveraging existing supervised fine-tuning data as preference data for DPO. This strategy aims to enhance the quality of preference data by utilizing high-quality labeled datasets that may already exist from previous SFT processes.

Examples:
- Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs

Main Difficulties in RLHF

Data Collection

In practice, people noticed that the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria change. And there is increasing difficulty in annotating preference data as models become more advanced, particularly because distinguishing between outputs becomes more nuanced and subjective.

The paper “CDR: Customizable Density Ratios of Strong-over-weak LLMs for Preference Annotation” explains that as models become more advanced, it becomes harder to identify which output is better due to subtle differences in quality. This makes preference data annotation increasingly difficult and subjective.
Another paper, “Improving Context-Aware Preference Modeling for Language Models,” discusses how the underspecified nature of natural language and multidimensional criteria make direct preference feedback difficult to interpret. This highlights the challenge of providing consistent annotations when outputs are highly sophisticated and nuanced.
“Less for More: Enhancing Preference Learning in Generative Language Models” also notes that ambiguity among annotators leads to inconsistently annotated datasets, which becomes a greater issue as model outputs grow more complex.

Reward Hacking

Reward hacking is a common problem in reinforcement learning, where the agent learns to exploit the system by maximizing its reward through actions that deviate from the intended goal. In the context of RLHF, reward hacking occurs when training settles in an unintended region of the loss landscape. In this scenario, the model generates responses that achieve high reward scores, but these responses may fail to be meaningful or useful to the user.

In PPO, reward hacking occurs when the model exploits flaws or ambiguities in the reward model to achieve high rewards without genuinely aligning with human intentions. This is because PPO relies on a learned reward model to guide policy updates, and any inaccuracies or biases in this model can lead to unintended behaviors being rewarded. PPO is particularly vulnerable to reward hacking if the reward model is not robustly designed or if it fails to capture the true objectives of human feedback. The iterative nature of PPO, which involves continuous policy updates based on reward signals, can exacerbate this issue if not carefully managed.

DPO avoids explicit reward modeling by directly optimizing policy based on preference data. However, it can still encounter issues similar to reward hacking if the preference data is biased or if the optimization process leads to overfitting specific patterns in the data that do not generalize well. While DPO does not suffer from reward hacking in the traditional sense (since it lacks a separate reward model), it can still find biased solutions that exploit out-of-distribution responses or deviate from intended behavior due to distribution shifts between training and deployment contexts.

The article "Reward Hacking in Reinforcement Learning" by Lilian Weng discusses how reward hacking occurs when a RL agent exploits flaws or ambiguities in the reward function to achieve high rewards without genuinely learning the intended task. It highlights that in RLHF for language models, reward hacking is a critical challenge, as models might learn to exploit unit tests or mimic biases to achieve high rewards, which can hinder real-world deployment.
The research "Scaling Laws for Reward Model Overoptimization" explores how optimizing against reward models trained to predict human preferences can lead to overoptimization, hindering the actual objective.
1. Impact of Policy Model Size: Holding the RM size constant, experiments showed that larger policy models exhibited similar overoptimization trends as smaller models, despite achieving higher initial gold scores. This implies that their higher performance on gold rewards does not lead to excessive optimization pressure on the RM.
2. Relationship with RM Data Size: Data size had a notable effect on RM performance and overoptimization. Models trained on fewer than ~2,000 comparison labels showed near-chance performance, with limited improvement in gold scores. Beyond this threshold, all RMs, regardless of size, benefited from increased data, with larger RMs showing greater improvements in gold rewards compared to smaller ones.
3. Scaling Laws for RM Parameters and Data Size: Overoptimization patterns scaled smoothly with both RM parameter count and data size. Larger RMs demonstrated better alignment with gold rewards and less susceptibility to overoptimization when trained on sufficient data, indicating improved robustness.
4. Proxy vs. Gold Reward Trends: For small data sizes, proxy reward scores deviated significantly from gold reward scores, highlighting overoptimization risks. As data size increased, the gap between proxy and gold rewards narrowed, reducing overoptimization effects.

Note that the KL divergence term in the RLHF objective is intended to prevent the policy from deviating too much from a reference model, thereby maintaining stability during training. However, it does not fully prevent reward hacking. Reward hacking occurs when an agent exploits flaws or ambiguities in the reward model to achieve high rewards without genuinely aligning with human intentions. The KL divergence penalty does not correct these flaws in the reward model itself, meaning that if the reward model is misaligned, the agent can still find ways to exploit it. KL does not directly address whether the actions align with the true objectives or desired outcomes.

Understanding QLoRA

Posted on 2024-12-26

QLoRA is never as simple as a single line of code. Let's start from the scaling law...

Neural Scaling Law

In the context of LLMs, a scaling law refers to an empirical relationship that describes how the performance of a model changes as key resources—such as model size (number of parameters), dataset size, and computational power—are scaled up or down. These laws provide insights into how increasing these factors impacts model accuracy, efficiency, and generalization capabilities.

Compute, dataset size, and model size are not independent of each other. Data size and model size together determine compute. The paper "Algorithmic progress in language models" came up with a rule $C=6ND$ where $C$ is compute, $N$ is model size, and $D$ is data size.

According to the paper "Scaling Laws for Neural Language Models" and "Training Compute-Optimal Large Language Models", below are the key takeaways of scaling laws for LLMs:

Performance Improvement: Model performance often improves predictably with increases in size, data, and compute, typically following a power-law relationship.
Diminishing Returns: Beyond certain thresholds, the benefits of scaling diminish, meaning further increases in resources yield smaller performance gains.
Trade-offs: Effective scaling requires balancing resources like model parameters and training data. For example, the "Chinchilla scaling law" highlights that increasing data size can sometimes yield better results than merely increasing model size in compute-constrained settings.

These observations are critical for LLM research:

Guidance for Optimization: Scaling laws help researchers allocate resources efficiently and predict the outcomes of scaling efforts, guiding both model design and training strategies. For example, within fixed computational constraints and limited training duration, scaling laws provide a principled approach to determining the optimal model size that minimizes test loss.
Predicting model performance: As demonstrated in GPT-4 Technical Report, by fitting the scaling law to the loss of smaller models, the loss of a bigger model can be predicted accurately.

The scaling law overlooks a critical practical consideration, which can lead to misconceptions. While it suggests that larger models yield better performance, in reality, the primary compute bottleneck lies in inference rather than training. Training compute is often less constrained because training time can be extended, but deployment costs are significantly higher. From a practical standpoint, a more efficient approach is to train a smaller model for an extended period, as this substantially reduces inference compute requirements.

Quantization

Background

As the scaling law suggested, when training a LLM, reducing the number of parameters is probably not an optimal idea for saving computational resource. Luckily, neural nets are robust in low precision, which means lowering precision won't reduce much model performance.

In GTC March 2024 Keynote with NVIDIA CEO Jensen Huang, Jensen stated that NVIDIA has achieved 1000X increase compute power for the past 8 years, faster than Moore’s law. It can be noticed that in the graph they show TFLOPs on FP8 precision in 2022 and TFLOPs on FP4 precision in 2024. This is a trick because it's easier to achieve higher TFLOPs when the precision is lower. And it shows that there is a trend in hardware industry to achieve higher TFLOPs in low precision.

Data Structure - FP32

The IEEE 754 single-precision floating-point format (FP32) represents a 32-bit number in binary form. It is used for approximating real numbers. The FP32 format has three components:

Sign bit (1 bit): Indicates whether the number is positive (0) or negative (1).
Exponent (8 bits): Encodes the exponent, biased by 127 to allow both positive and negative exponents.
Mantissa or Fraction (23 bits): Represents the significant digits of the number.

The formula for FP32 is as follows: \[ \text{Value} = (-1)^{\text{Sign}} \times (1.\text{Mantissa}) \times 2^{\text{Exponent}-127} \] Following this formula, we can calculate FP32 number. Below is an example:

FP32 provides a wider range and higher precision, making it suitable for tasks requiring numerical accuracy, such as training large-scale deep learning models. FP16, with its lower precision, is optimized for speed and memory efficiency. It is particularly effective for inference tasks or mixed-precision training when paired with FP32 for critical calculations.

However, the overflow problem of FP16 arises due to its limited range of representable values. FP16 has a maximum representable value of 65,504 ($2^{15} \times (2 - \epsilon)$), which is much smaller compared to FP32's maximum value of approximately $3.4 \times 10^{38}$. When computations produce results exceeding this range, an overflow occurs, and the value is replaced by infinity ($\pm \infty$). Overflow in FP16 can occur during operations like matrix multiplications or summations in deep learning if the intermediate values exceed the maximum representable range. For example, scaling large tensors or performing high-magnitude computations without normalization can easily result in overflow when using FP16. Overflow leads to loss of numerical accuracy and can destabilize training processes in machine learning. It also affects applications like image processing or scientific simulations where precision and stability are critical.

There are some strategies to mitigate this overflow problem:

Use mixed-precision training. FP16 is used for most computations but critical operations (e.g., gradient accumulation) are performed in FP32 to prevent overflow.
Normalize inputs and intermediate values to keep them within the representable range of FP16.
Use alternative formats like BF16, which have a larger dynamic range while maintaining reduced precision.

Googel Brain BF16 uses the same number of exponent bits as FP32 (8 bits), giving it a much larger dynamic range compared to FP16. This means BF16 can represent very large and very small numbers similar to FP32, avoiding underflows and overflows that are common in FP16. Converting from FP32 to BF16 is straightforward because both formats share the same exponent size. The conversion simply involves truncating the mantissa from 23 bits to 7 bits. BF16 uses only 16 bits per value, reducing memory usage by half compared to FP32. This allows for larger batch sizes and models to fit into limited GPU or TPU memory without sacrificing as much numerical range as FP16 does.

Recently, people have started discussing NVIDIA’s FP8 formats (E4M3 and E5M2) as alternatives to BF16 because of their potential to significantly reduce computational and memory costs while maintaining competitive performance in large-scale machine learning tasks. E4M3 offers higher precision, making it suitable for inference and forward-pass computations where precision is critical. E5M2 provides a wider dynamic range, making it ideal for backward-pass computations during training where large gradients can occur. This flexibility allows FP8 to adapt to different stages of training more effectively than BF16.

NVIDIA’s H100 GPUs are specifically designed to support FP8 with optimized Tensor Cores, achieving up to 9x faster training and 30x faster inference compared to previous-generation GPUs using FP16 or BF16. The Hopper architecture dynamically manages precision transitions (e.g., between FP8 and higher-precision formats like FP32), ensuring stability without manual intervention. "Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs" shows that FP8 can deliver similar convergence behavior and accuracy as BF16 in many LLM tasks, with minimal degradation in performance. For inference, FP8 quantization (e.g., E4M3 for KV cache) has been shown to minimally impact accuracy while significantly improving memory efficiency.

However, FP8 comes with challenges such as occasional instability during training (e.g., loss spikes) and sensitivity in certain tasks like code generation or mathematical reasoning. As a result, training LLMs with FP8 precision remains an active area of research and exploration.

Feature	IEEE 754 FP32	IEEE 754 FP16	Google BF16	NVIDIA FP8 E4M3	NVIDIA FP8 E5M2
Bit Width	32 bits	16 bits	16 bits	8 bits	8 bits
Sign Bit	1 bit	1 bit	1 bit	1 bit	1 bit
Exponent Bits	8 bits (bias = 127)	5 bits (bias = 15)	8 bits (bias = 127)	4 bits (bias = 7)	5 bits (bias = 15)
Mantissa Bits	23 bits	10 bits	7 bits	3 bits	2 bits
Dynamic Range	\[ \pm(2^{-126} \text{ to } 2^{127}) \]	\[ \pm(2^{-14} \text{ to } 2^{15}) \]	\[ \pm(2^{-126} \text{ to } 2^{127}) \]	\[ \pm(2^{-6} \text{ to } 2^{7}) \]	\[ \pm(2^{-14} \text{ to } 2^{15}) \]
Precision	~7 decimal digits	~3.3 decimal digits	~2.3 decimal digits	Lower precision	Lower precision
Memory Usage	High	Medium	Medium	Low	Low
Performance	Slower	Faster than FP32	Faster than FP32	Much faster than FP16/BF16	Much faster than FP16/BF16
Applications	Training requiring high precision	Inference or mixed-precision training	Mixed-precision training and inference	Optimized for inference	Optimized for training and inference

Current LLM Training Method in FP8

A training approach has been developed to leverage FP8's efficiency for specific operations while maintaining numerical stability and precision with BF16 for critical components of the model.

During the training process, FP8 is utilized exclusively for computations within the MLP layers, while BF16 is employed for other components of the Transformer architecture, such as Attention, Activation, and Layer Normalization. Both weights and gradients are maintained in BF16 precision.

In the forward pass, weights in BF16 are converted to FP8 (E4M3) for matrix multiplications within the MLP layers. Once the computation is completed, the results are immediately converted back to BF16.
In the backward pass, gradients in BF16 are temporarily converted to FP8 (E5M2) when passing through the MLP layers. After the computations are performed, the results are promptly converted back to BF16.

Even when FP8 is used, RAM savings may not be as significant during training because high precision gradients and weights must be maintained in memory to ensure model stability and convergence. The primary benefit of FP8 lies in its ability to reduce memory usage during inference, where weights can be stored in FP8 format, significantly decreasing the memory footprint compared to higher precision formats like FP16 or BF16. Despite this, FP8 is still utilized during training because it allows for faster computations due to its lower precision. This results in accelerated training processes and improved efficiency, especially on hardware optimized for FP8 operations, such as NVIDIA’s H100 GPUs.

Quantization Process

The process of quantization in LLMs refers to a model compression technique that maps high-precision values (e.g., FP32) to lower-precision representations (e.g., INT8 or FP8).

Here is an example of a simple step-by-step quantization from FP16 to INT4:

Range Calculation: Determine the range of FP16 values for the weights or activations. This is typically defined by the minimum and maximum values ($[min, max]$) in the data.
Scale Factor and Zero-Point Computation: Compute a scaling factor (S) that maps the FP16 range to the INT4 range ($[-8, 7]$ for signed INT4 or $[0, 15]$ for unsigned INT4). Optionally, calculate a zero-point (Z) to handle asymmetric quantization, where zero in FP16 does not align with zero in INT4.

The formula for quantization is: \[ x_q = \text{round}\left(\frac{x}{S} + Z\right) \] where $x_q$ is the quantized INT4 value, $x$ is the original FP16 value, $S$ is the scaling factor, and $Z$ is the zero-point.
Quantization: Map each FP16 value to its corresponding INT4 representation using the computed scale factor and zero-point. This step reduces precision but compresses the data significantly.

There are different types of quantization:

Asymmetric Quantization vs Summetric Quantization
Uniform Quantization vs Non-uniform Quantization

Quant in General Matrix Multiply (GEMM)

Quantized matrices are stored in memory in their compressed form. During matrix multiplication, these matrices are dequantized back to higher precision (e.g., FP16 or FP32) to perform computations. This process balances memory efficiency with computational precision.

Quantization can be applied at different levels of granularity, which determines how scaling factors are assigned and used. The "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models" paper introduced several quantization granularity techniques, including per-tensor quantization, per-token quantization, and per-channel quantization:

Per-Tensor Quantization: A single scaling factor is applied to the entire tensor (e.g., a weight matrix or activation matrix). It is highly memory-efficient since only one scaling factor needs to be stored. But it is not recommended in practice because outlier values can dominate the scaling factor, leading to significant quantization errors for the rest of the tensor.
Per-Channel Quantization: Each channel (e.g., each column of a weight matrix or each feature map in activations) has its own scaling factor. Commonly used for weight matrices in neural networks. It mitigates the impact of outliers by isolating them within individual channels, improving quantization accuracy compared to per-tensor methods. But it can introduce computational overhead during dequantization due to the need for multiple scaling factors.
Per-Token Quantization: Each token's activations are assigned a unique scaling factor. Typically used for activations in transformer models. It captures token-specific variations in activations, leading to better precision for tasks with dynamic token distributions. Per-token quantization can be computationally expensive and slower because it requires more scaling factors and additional computations.
Group-Wise Quantization (GWQ): this method groups multiple channels or tokens together and applies a shared scaling factor across the group. It reduces computational overhead compared to per-channel or per-token quantization while maintaining finer granularity than per-tensor methods. It's often used for both weights and activations to strike a balance between accuracy and efficiency.

QLoRA

Fine-Tuning Cost

Comparing cost of full fine tuning, LoRA fine tuning, and QLoRA fine tuning:

	Full Finetuning	LoRA	QLoRA
Weight	16 bits	16 bits	4 bits
Weight Gradient	16 bits	~0.4 bits	~0.4 bits
Optimizer stage	64 bits	~0.8 bits	~0.8 bits
Adapter weights	/	~0.4 bits	~0.4 bits
Totel	96 bits per parameter	17.6 bits per parameter	5.2 bits per parameter

QLoRA's Contributions

Paper: QLoRA: Efficient Finetuning of Quantized LLMs

4-bit NormalFloat Quantitzation

4-bit NormalFloat Quantitzation adopts the idea of Quantile Quantization which is an information-theoretic method that maps values based on quantiles of the weight distribution. It's a method of data compression where data is quantized (reduced to a smaller set of discrete values) in a way that aims to minimize the information entropy of the resulting data, essentially achieving the most efficient representation possible while still introducing some loss in information, making it a "lossy minimum entropy encoding" technique.

To compute the quantile function for 4-bit NormalFloat (NF4) quantization, the process involves mapping cumulative probabilities to quantization levels optimized for normally distributed data. The quantile function is the inverse of the cumulative distribution function (CDF). For example, as shown in the description, if the probability of $x < 1.2$ is 0.9, then 1.2 is the corresponding quantile for a cumulative probability of 0.9.

With this quantile function, the probability range from 0 to 1 is divided into 16 equal-sized buckets, as 4 bits can represent $2^4 = 16$ distinct values. The steps are as follows:

Divide the Probability Range: The range of cumulative probabilities $[0, 1]$ is divided into 16 equal intervals or "buckets." These intervals represent equal portions of the probability mass.
Apply the Quantile Function: For each bucket's cumulative probability value (e.g., $p_i = \frac{i}{16}$, where $i \in [1, 15]$), the corresponding quantile value is computed using the inverse CDF of a standard normal distribution ($\Phi^{-1}(p_i)$).
Normalize Quantiles: The resulting quantiles are normalized to fit within a predefined range, typically $[-1, 1]$. This ensures that all quantization levels are symmetrically distributed around zero and fall within a compact range suitable for efficient representation.
Assign NF4 Values: The normalized quantiles become the 16 discrete values used by NF4 to represent weights or activations in a compressed format. These values are spaced closer together near zero (where most of the normal distribution's probability mass lies) and farther apart at the extremes, optimizing precision where it matters most.

Double Quantization

Double Quantization (DQ) as introduced in paper "QLoRA: Efficient Finetuning of Quantized LLMs" is a memory optimization technique that quantizes the quantization constants themselves to further reduce the memory footprint of LLMs. It involves two quantization steps:

The first quantization involves quantizing the original weights of the model into 4-bit NormalFloat (NF4) format. Weights are divided into small blocks (e.g., 64 elements per block), and each block is scaled by a quantization constant (also known as a scaling factor). This constant ensures that the range of values in each block fits within the representable range of NF4. The quantized weights and their corresponding quantization constants are stored. However, these constants (usually in FP32) can add significant memory overhead.

To calculate the memory overhead: for a block size of 64, storing a 32 bit quantization constant for each block adds $32/64=0.5$ bits per parameter on average.
The second quantization aims to reduce the memory overhead caused by storing quantization constants. Those quantization constants $c^{FP32}_2$ are further quantized into 8-bit floating-point values (FP8) with a larger block size (e.g., 256 elements per block). This is a summetric quantization where the mean of the first level factors $c^{FP32}_2$ is subtracted to center their distribution around zero. This reduces their memory footprint while maintaining sufficient precision for scaling operations. Additionally, another set of quantization constants $c^{FP32}_1$ is introduced to scale these second-level quantized values.

To calculate the memory savings: after double quantization, the memory footprint per parameter for scaling factors is reduced from $32/64=0.5$ bits to $8/64 + 32/(64\times 256)=0.127$ bits per parameter. This results in saving $0.5-0.127=0.373$ bits per parameter.

The authors of paper "QLoRA: Efficient Finetuning of Quantized LLMs" compared LLaMA models with different 4-bit data types. They show that the NormalFloat data type significantly improves the bit-for-bit accuracy gains compared to regular 4-bit Floats. While Double Quantization only leads to minor gains, it allows for a more fine-grained control over the memory footprint to fit models of certain size (33B/65B) into certain GPUs (24/48GB). This empirical results show that using FP8 for second-level quantization does not degrade model performance, making it an effective trade-off between precision and memory efficiency.

Paged Optimizers

As described in the QLoRA paper, paged optimizers are a memory management innovation that leverages NVIDIA Unified Memory to handle the memory spikes that occur during gradient checkpointing or when processing large mini-batches with long sequence lengths. NVIDIA Unified Memory allows seamless memory sharing between the GPU and CPU. When the GPU runs out of memory during training, optimizer states (e.g., gradients, momentum, or scaling factors) are paged out (evicted) to CPU RAM. These states are paged back into GPU memory only when needed for computations like gradient updates.

Forward and Backward Implementation

Forward:

Backward:

QLora Usage

QLoRA utilizes bitsandbytes for quantization and is seamlessly integrated with Hugging Face's PEFT and transformers libraries, making it user-friendly. To explore the implementation further, let's dive into the QLoRA code and examine the train() function in qlora.py.

def train():
    hfparser = transformers.HfArgumentParser((
        ModelArguments, DataArguments, TrainingArguments, GenerationArguments
    ))
    model_args, data_args, training_args, generation_args, extra_args = \
        hfparser.parse_args_into_dataclasses(return_remaining_strings=True)
    training_args.generation_config = transformers.GenerationConfig(**vars(generation_args))
    args = argparse.Namespace(
        **vars(model_args), **vars(data_args), **vars(training_args)
    )
    print(args)
    
    checkpoint_dir, completed_training = get_last_checkpoint(args.output_dir)
    if completed_training:
        print('Detected that training was already completed!')

    model, tokenizer = get_accelerate_model(args, checkpoint_dir)

    ......

The get_accelerate_model() function initializes your model and is a crucial component of implementing QLoRA. Notably, within the AutoModelForCausalLM.from_pretrained() method, it loads the quantization configuration through BitsAndBytesConfig. This setup ensures that the model weights are automatically quantized.

def get_accelerate_model(args, checkpoint_dir):

    if torch.cuda.is_available():
        n_gpus = torch.cuda.device_count()
    if is_ipex_available() and torch.xpu.is_available():
        n_gpus = torch.xpu.device_count()
        
    max_memory = f'{args.max_memory_MB}MB'
    max_memory = {i: max_memory for i in range(n_gpus)}
    device_map = "auto"

    # if we are in a distributed setting, we need to set the device map and max memory per device
    if os.environ.get('LOCAL_RANK') is not None:
        local_rank = int(os.environ.get('LOCAL_RANK', '0'))
        device_map = {'': local_rank}
        max_memory = {'': max_memory[local_rank]}


    if args.full_finetune: assert args.bits in [16, 32]

    print(f'loading base model {args.model_name_or_path}...')
    compute_dtype = (torch.float16 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32))
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        cache_dir=args.cache_dir,
        load_in_4bit=args.bits == 4,
        load_in_8bit=args.bits == 8,
        device_map=device_map,
        max_memory=max_memory,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=args.bits == 4,
            load_in_8bit=args.bits == 8,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_use_double_quant=args.double_quant,
            bnb_4bit_quant_type=args.quant_type,
        ),
        torch_dtype=(torch.float32 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32)),
        trust_remote_code=args.trust_remote_code,
        use_auth_token=args.use_auth_token
    )
    ......

Other than some necessary components like tokenizer, train() gives an option of LoRA in addition to full finetune. It requires setup of LoRA config and get_peft_model function from peft package.

def train():
    ......
  if not args.full_finetune:
          model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=args.gradient_checkpointing)

      if not args.full_finetune:
          if checkpoint_dir is not None:
              print("Loading adapters from checkpoint.")
              model = PeftModel.from_pretrained(model, join(checkpoint_dir, 'adapter_model'), is_trainable=True)
          else:
              print(f'adding LoRA modules...')
              modules = find_all_linear_names(args, model)
              config = LoraConfig(
                  r=args.lora_r,
                  lora_alpha=args.lora_alpha,
                  target_modules=modules,
                  lora_dropout=args.lora_dropout,
                  bias="none",
                  task_type="CAUSAL_LM",
              )
              model = get_peft_model(model, config)
    ......

Not every layers are quantized. QLoRA only quantizes linear projection layers. Some layers like Layer Norm is sensitive to precision, so high precision is required.

Understanding LoRA

Posted on 2024-12-25

I finally got time to have some deep dives. Happy Christmas!

RAM Usage During Training

Training large-scale machine learning models e.g. LLMs, requires significant compute resources. Here’s a breakdown of the possible memory usage (RAM) at various stages of the classic training process, based on the pseudocode below:

model = Model()
optimizer = Adam(model.parameters())

for batch, (X, y) in enumerate(dataloader):
    # Compute prediction and loss
    pred = model(X)
    loss = loss_fn(pred, y)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Key Components of Memory Usage

Model Parameters: These are the trainable weights of the model, which need to be stored in memory throughout the training process. The size is proportional to the number of parameters in the model.
Model Gradients: Gradients for each parameter are computed during backpropagation and stored temporarily for the optimizer to update the weights.
Optimizer States: Optimizers like Adam maintain additional states, including:
- First-order momentum: Tracks the moving average of gradients.
- Second-order momentum: Tracks the moving average of squared gradients.
- Both momentum terms have the same size as the model gradients.
Activations: Activation outputs from the forward pass are stored for use during backpropagation, where the Hessian matrix is multiplied with the activations. The memory required for activations can be substantial, especially as batch size increases. While the size of parameters, gradients, and optimizer states remains constant, activation memory scales directly with batch size.
Other Overheads: Temporary buffers and memory fragmentation during computation also contribute to RAM usage.

Memory Calculation Examples

Gradients and Parameters:

For 70B model, using 32-bit floating-point precision (FP32): \[ 70\times10^9\times4 \text{ bytes}\times2 =521.5\text{GM} \] This accounts for the weights and their corresponding gradients.
Optimizer State:

Adam optimizer requires two additional states (first and second-order momentum), each the same size as the gradients: \[ 70\times10^9\times4 \text{ byte}\times2 =521.5\text{GM} \]
Activations:

For 70B model with a hidden size of 8192, 80 layers, and FP32 precision, each token’s activation memory: \[ 8192\times80\times4\times12 \text{ bytes/token}=30\text{ MB/token} \]

Simple Strategies for Reducing Memory Usage

Activation Checkpointing: Instead of storing all activation outputs, recompute activations during backpropagation as needed. This significantly reduces activation memory at the cost of additional compute time.
Mixed Precision Training (FP16): Use 16-bit floating-point precision (FP16) instead of FP32 for model weights, gradients, and activations. This halves the memory requirements without substantial accuracy loss when done correctly.

LoRA

Adapters

The original adapter was introduced in 2019 in the paper "Parameter-Efficient Transfer Learning for NLP". It's a small, additional module added to a pre-trained model to adapt it to a new task without significantly changing the original model parameters.

Adapters generally reduce training latency compared to full fine-tuning because only a small number of parameters (those within the adapter modules) are updated during training. This reduction in trainable parameters leads to lower computational overhead and faster convergence in many cases. Additionally, adapters allow for larger batch sizes due to reduced memory usage, which can further accelerate training

However, adapter layers increase inference latency because they are added sequentially and cannot be parallelized. This issue becomes more pronounced with small batch sizes or when using sharded models, such as GPT-2. Techniques like layer pruning or multi-task settings can mitigate but not completely eliminate this latency.

As shown in the experiment results below, inference latent can be significant (Source: LoRA paper):

LoRA Basics

LoRA (Low-Rank Adaptation) was introduced by a Microsoft team in 2021 in the paper LoRA: Low-Rank Adaptation of Large Language Models. The main idea of LoRA is to enable efficient fine-tuning of large pre-trained models by introducing low-rank trainable matrices into the model’s architecture, while keeping the original model weights frozen. This approach significantly reduces the number of trainable parameters and computational requirements compared to full fine-tuning, without compromising performance.

LoRA approximates weight updates in neural networks using low-rank matrix factorization. Instead of updating the full weight matrix $W$ , it introduces two smaller trainable matrices $A$ and $B$ with size $(r \times d)$ and $(d \times r)$. These matrices have much fewer parameters, as their rank $r$ is much smaller than the dimensions of $W$. Instead of training $\Delta W$, LoRA trains the parameters in $A$ and $B$. This can be written in formula: \[ h=W_0x + \Delta Wx = W_0x + BAx \] where $W_0$ is original prerained weight matrix in size $(d\times d)$ which is frozen during training; $\Delta W$ is in $(d \times d)$ as well computed by $BA$. $x$ is a new input with size $(1 \times d)$.

At the start of the training process, the matrix $ A $ is randomly initialized following a normal distribution $\mathcal{N}(0, \sigma^2)$, while the matrix $ B $ is initialized as a zero matrix. In the initial round, this setup results in $ BA = 0 $, leading to $ h = W_0x $. This initialization strategy ensures stability by preventing significant deviations of $ W_0 $ from its original state.

LoRA is a groundbreaking method with a lot of benefits:

Parameter Efficiency: By training only the low-rank matrices, LoRA reduces the number of updated parameters resulting in lower memory usage and faster training.
Frozen Pre-trained Weights: The original pre-trained weights remain unchanged, preserving the model’s general-purpose knowledge and avoiding catastrophic forgetting.
No Inference Latency Overhead: Unlike adapters, LoRA does not add additional layers to the model. The low-rank matrices can be merged back into the original weight matrix after fine-tuning, ensuring no additional inference latency.
Versatility: LoRA can be applied to various architectures (e.g. transformers) and tasks, making it a flexible solution for adapting large models like GPT-3 or RoBERTa to specific use cases.

LoRA Usage

The Microsoft developers of LoRA created a Python package called loralib to facilitate the use of LoRA. With this library, any linear layer implemented as nn.Linear() can be replaced by lora.Linear(). This is possible because LoRA is designed to work with any layer involving matrix multiplication. The lora.Linear() module introduces a pair of low-rank adaptation matrices, which are used to modify the original weight matrix by applying a low-rank decomposition.

# ===== Before =====
# layer = nn.Linear(in_features, out_features)
# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)

Before training the model, all non-lora matrix should be fixed and only LoRA matrices should be set as trainable. Training loops can run as usual.

import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string "lora_" in their names
lora.mark_only_lora_as_trainable(model)
# Training loop
for batch in dataloader:
   ...

When saving model checkpoints during LoRA fine-tuning, only the LoRA-specific parameters need to be saved, not the entire large pre-trained model. This results in significantly smaller checkpoint files and more efficient storage.

# ===== Before =====
# torch.save(model.state_dict(), checkpoint_path)
# ===== After =====
torch.save(lora.lora_state_dict(model), checkpoint_path)

Implementation of LoRA - lora.Linear()

Let's take a deep dive into the lora.Linear() source code:

The lora.Linear class builds upon torch.nn.Linear(). It retains the original weight matrix $ W $ as initialized in nn.Linear.__init__(self, in_features, out_features), and introduces two additional LoRA matrices: self.lora_A and self.lora_B. The matrix self.lora_A has dimensions of $ (r, ) $, while self.lora_B has dimensions of $ (, r) $. These matrices are used to adapt the original weight matrix through low-rank decomposition.

class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
                           merge_weights=merge_weights)

        self.fan_in_fan_out = fan_in_fan_out
        # Actual trainable parameters
        if r > 0:
            self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
            self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
            self.scaling = self.lora_alpha / self.r
            # Freezing the pre-trained weight matrix
            self.weight.requires_grad = False
        self.reset_parameters()
        if fan_in_fan_out:
            self.weight.data = self.weight.data.transpose(0, 1)

In the forward() function, it implements $h=W_0x + \Delta Wx = W_0x+ BAx$.

There is a flag variable called self.merge which is use to flag whether it's doing inference or training. Recall that the original weight matrix remaining unchanged during LoRA training is a key feature of the LoRA - pre-trained weights are freezed and instead small, low-rank matrices are trained to approximate updates.

During inference, if merge_weights is set to True, the low-rank updates self.lora_B @ self.lora_A are added directly to the frozen pre-trained weights (self.weight). This avoids the need for separate computations of LoRA updates during forward passes, improving efficiency.
During training, if merge_weights is enabled and weights were previously merged, the updates are subtracted from self.weight to revert it to its original frozen state. This ensures that gradients are not incorrectly computed on the merged weights.

class Linear(nn.Linear, LoRALayer):
    # LoRA implemented in a dense layer
    def __init__(
        self, 
        in_features: int, 
        out_features: int, 
        r: int = 0, 
        lora_alpha: int = 1, 
        lora_dropout: float = 0.,
        fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        merge_weights: bool = True,
        **kwargs
    ):

      ......
      
    def train(self, mode: bool = True):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        nn.Linear.train(self, mode)
        if mode:
            if self.merge_weights and self.merged:
                # Make sure that the weights are not merged
                if self.r > 0:
                    self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = False
        else:
            if self.merge_weights and not self.merged:
                # Merge the weights and mark it
                if self.r > 0:
                    self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
                self.merged = True    
                
			def forward(self, x: torch.Tensor):
        def T(w):
            return w.transpose(0, 1) if self.fan_in_fan_out else w
        if self.r > 0 and not self.merged:
            result = F.linear(x, T(self.weight), bias=self.bias)            
            result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
            return result
        else:
            return F.linear(x, T(self.weight), bias=self.bias)
          
      ......

LangChain Common Practices

Posted on 2024-12-25

This is a collection of some common useful LangChain (v.0.3.3) practices based on my coding experience so far.

LLM Application Development Landscape

Nowadays, LLM applications can be classified into the following categories.

Simple LLM Calls

Applications where LLMs are used directly to answer questions or perform tasks without additional layers of complexity. The focus is on generating responses to prompts or queries. These are straightforward implementations, often used for tasks like content generation, question answering, or summarization.

Real-world examples:
- ChatGPT for Q&A: Users input questions, and the model directly generates answers.
- Copywriting Tools: Applications like Jasper AI create marketing content, blogs, or product descriptions based on user inputs.
Vectorstores (RAG)

Vectorstores are used in Retrieval-Augmented Generation (RAG) applications, where relevant information is retrieved from a database of embeddings (vectorized representations of text) to enhance the LLM's responses. This allows the LLM to work with domain-specific or proprietary knowledge not contained in its training data.

Real-world examples:
- Chatbots for Enterprises: A customer support chatbot retrieves relevant product documentation or FAQs stored in a vectorstore to provide accurate responses.
- Search-Augmented Systems: Google Bard integrates real-time information retrieval to provide up-to-date and contextually relevant responses.
Agents

Agents are LLM-driven systems that execute tasks autonomously or semi-autonomously based on input instructions. They can make decisions, interact with APIs, and manage workflows. Agents often use reasoning frameworks like ReAct (Reasoning and Acting) to decide what steps to take next.

Real-world examples:
- Zapier AI Assistant: Automates workflows by taking instructions, analyzing data, and executing API calls or actions across platforms.
- LangChain Agents: Used for multi-step tasks such as filling out forms, managing databases, or performing calculations.
Agents + Vectorstores

This combines the reasoning and decision-making capabilities of agents with the data retrieval abilities of vectorstores. These systems can autonomously fetch relevant knowledge from vectorstores and execute tasks, enabling advanced applications like AutoGPT. The integration provides both reasoning depth and domain-specific accuracy.

Real-world examples:
- AutoGPT: An open-source agent that can generate business plans by researching topics, retrieving relevant information, and autonomously completing subtasks.
- GPT Engineer: Helps developers by retrieving relevant programming resources and autonomously generating code, debugging, or improving software projects.

Chaining a Simple Prompt

from dotenv import load_dotenv
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser

load_dotenv()

# define information to be incorporated to the prompt template
information = """
    Elon Reeve Musk (/ˈiːlɒn/; EE-lon; born June 28, 1971) is a businessman and investor. 
    He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect and former chairman of Tesla, Inc.; owner, chairman and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. 
    He is the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index, and $254 billion according to Forbes, primarily from his ownership stakes in Tesla and SpaceX.
    A member of the wealthy South African Musk family, Elon was born in Pretoria and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citizenship through his Canadian-born mother. 
    Two years later, he matriculated at Queen's University at Kingston in Canada. Musk later transferred to the University of Pennsylvania, and received bachelor's degrees in economics and physics. 
    He moved to California in 1995 to attend Stanford University. However, Musk dropped out after two days and, with his brother Kimbal, co-founded online city guide software company Zip2. 
    The startup was acquired by Compaq for $307 million in 1999, and, that same year Musk co-founded X.com, a direct bank. X.com merged with Confinity in 2000 to form PayPal.
    In October 2002, eBay acquired PayPal for $1.5 billion, and that same year, with $100 million of the money he made, Musk founded SpaceX, a spaceflight services company. 
    In 2004, he became an early investor in electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, assuming the position of CEO in 2008. 
    In 2006, Musk helped create SolarCity, a solar-energy company that was acquired by Tesla in 2016 and became Tesla Energy. In 2013, he proposed a hyperloop high-speed vactrain transportation system. 
    In 2015, he co-founded OpenAI, a nonprofit artificial intelligence research company. 
    The following year, Musk co-founded Neuralink—a neurotechnology company developing brain–computer interfaces—and the Boring Company, a tunnel construction company. 
    In 2022, he acquired Twitter for $44 billion. He subsequently merged the company into newly created X Corp. and rebranded the service as X the following year. 
    In March 2023, he founded xAI, an artificial intelligence company.
"""

# create a prompt template
template = """
Given the information {information} about a person, please create:
1. A short summary
2. Two interesting facts about the person.
"""

# incorporate information into prompt
summary_prompt_template = PromptTemplate(
    input_variables=["information"], template=template
)

# create an llm
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
# llm = ChatOllama(model="llama3")

# create a chain
chain = summary_prompt_template | llm | StrOutputParser()

# prompt the model
response = chain.invoke(input={"information": information})

print(response)

Parsing the Output with a Customized Format

Using PydanticOutputParser and user defined output data structure.

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_openai import OpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")


# define your desired data structure
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # add custom validation logic easily with Pydantic
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

# create a prompt with query and instruction
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

#a query intended to prompt a language model to populate the data structure
prompt_and_model = prompt | llm
output = prompt_and_model.invoke({"query": "Tell me a joke."})
parser.invoke(output)

Data Ingestion to Pinecone Vectorstore (RAG)

Using TextLoader, CharacterTextSplitter, OpenAIEmbeddings, and Pinecone vector database.

Please refer to LangChain text splitter techniques ;text Split by character; text embedding models for more details.

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

load_dotenv()

# load data
loader = TextLoader("doc1.txt")
document = loader.load()

# split data
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(document)

# create embedding
embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get("OPENAI_API_KEY"))

# ingest data to vector db
PineconeVectorStore.from_documents(texts, embeddings, index_name=os.environ['INDEX_NAME'])print("finish")

Data Retrieval from Pinecone Vectorestore (RAG)

langchain-ai/retrieval-qa-chat is a ChatPromptTemplate ensuring answers are based solely on the context.

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI()

# build user query prompt
query = "what is Pinecone in machine learning?"
chain = PromptTemplate.from_template(template=query) | llm

# store query prompt to vector db
vectorstore = PineconeVectorStore(
    index_name=os.environ["INDEX_NAME"], embedding=embeddings
)

# create a retrieval qa prompt
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

# create a prompt chain
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)

# create retrieval chain
retrival_chain = create_retrieval_chain(
    retriever=vectorstore.as_retriever(), combine_docs_chain=combine_docs_chain
)

# execute the retrieval
result = retrival_chain.invoke(input={"input": query})

print(result)

Customized retrieval prompt:

RunnablePassthrough is used to pass through arguments from one step to the next. It allows us to pass on the user's question to the prompt and model.

import os

from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_pinecone import PineconeVectorStore
from langchain_core.runnables import RunnablePassthrough

load_dotenv()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

embeddings = OpenAIEmbeddings()
llm = ChatOpenAI()

query = "what is Pinecone in machine learning?"
chain = PromptTemplate.from_template(template=query) | llm

vectorstore = PineconeVectorStore(
    index_name=os.environ["INDEX_NAME"], embedding=embeddings
)


template = """
Use the following pieces of context to answer the question at the end. 
If you don't know the answer, you can say "I don't know". Don't make up an answer.
Use three sentences maximum and keep the answer short and to the point.
Always say "thanks for the question" before answering the question.

{context}

Question: {question}

Answer:
"""

custom_rag_prompt = PromptTemplate.from_template(template=template)
rag_chain = (
    {"context": vectorstore.as_retriever() | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
)

res = rag_chain.invoke(query)
print(res)

Chat with a PDF (RAG with FAISS)

Using PyPDFLoader, CharacterTextSplitter, OpenAIEmbeddings, and FAISS local vector database.

Please refer to PDF loader ; Langchain FAISS vectorstore; FAISS for more details.

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain import hub

pdf_path = "react.pdf"
loader = PyPDFLoader(file_path=pdf_path)
documents = loader.load()
text_splitter = CharacterTextSplitter(
    chunk_size=1000, chunk_overlap=30, separator="\n"
)
docs = text_splitter.split_documents(documents=documents)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
vectorstore.save_local("faiss_index_react")

new_vectorstore = FAISS.load_local(
    "faiss_index_react", embeddings, allow_dangerous_deserialization=True
)

retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(
    OpenAI(), retrieval_qa_chat_prompt
)
retrieval_chain = create_retrieval_chain(
    new_vectorstore.as_retriever(), combine_docs_chain
)

res = retrieval_chain.invoke({"input": "Give me the gist of ReAct in 3 sentences"})
print(res["answer"])

Create a ReAct Agent

Using langchain-ai/react-agent-template to build a ReAct prompt.

from dotenv import load_dotenv
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain_experimental.tools import PythonREPLTool

load_dotenv()

# create an instruction
instructions = """You are an agent designed to write and execute python code to answer questions.
You have access to a python REPL, which you can use to execute python code.
If you get an error, debug your code and try again.
Only use the output of your code to answer the question. 
You might know the answer without running any code, but you should still run the code to get the answer.
If it does not seem like you can write code to answer the question, just return "I don't know" as the answer.
"""

# use an ReAct prompt template
base_prompt = hub.pull("langchain-ai/react-agent-template")
prompt = base_prompt.partial(instructions=instructions)

# make use a tool to execute python code
tools = [PythonREPLTool()]

# define a ReAct agent
agent = create_react_agent(
    prompt=prompt,
    llm=ChatOpenAI(temperature=0, model="gpt-4o-mini"),
    tools=tools,
)

# create the ReAct agent executor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# execute the ReAct agent
agent_executor.invoke(
    input={
        "input": """generate and save in current working directory a QRcode
                    that point to https://jokerdii.github.io/di-blog, you have qrcode package installed already"""
    }
)

agent_executor.invoke(
    input={
        "input": """generate and save in current working directory a synthetic csv dataset 
                    with 1000 rows and 2 columns that is about Amazon product description and price."""
    }
)

Using an LangChain Agent for Tasks

create_csv_agent is an AgentExecutor object able to perform operations in CSVs.

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_experimental.agents.agent_toolkits import create_csv_agent


load_dotenv()

# make use a CSV agent from langchain
csv_agent = create_csv_agent(
    llm=ChatOpenAI(temperature=0, model="gpt-4o-mini"),
    path="episode_info.csv",
    verbose=True,
)

# execute the agent
csv_agent.invoke(
    input={"input": "how many columns are there in file episode_info.csv"}
)
csv_agent.invoke(
    input={
        "input": "print the seasons by ascending order of the number of episodes they have."
    }
)

Creating an ReAct Agent with Multiple Agents Provided as Tools

from typing import Any

from dotenv import load_dotenv
from langchain import hub
from langchain_core.tools import Tool
from langchain_openai import ChatOpenAI
from langchain.agents import (
    create_react_agent,
    AgentExecutor,
)
from langchain_experimental.tools import PythonREPLTool
from langchain_experimental.agents.agent_toolkits import create_csv_agent


load_dotenv()


instructions = """You are an agent designed to write and execute python code to answer questions.
You have access to a python REPL, which you can use to execute python code.
You have qrcode package installed
If you get an error, debug your code and try again.
Only use the output of your code to answer the question. 
You might know the answer without running any code, but you should still run the code to get the answer.
If it does not seem like you can write code to answer the question, just return "I don't know" as the answer.
    """

base_prompt = hub.pull("langchain-ai/react-agent-template")
prompt = base_prompt.partial(instructions=instructions)

# define python agent
tools = [PythonREPLTool()]
python_agent = create_react_agent(
    prompt=prompt,
    llm=ChatOpenAI(temperature=0, model="gpt-4-turbo"),
    tools=tools,
)

python_agent_executor = AgentExecutor(agent=python_agent, tools=tools, verbose=True)

# define CSV agent
csv_agent_executor: AgentExecutor = create_csv_agent(
    llm=ChatOpenAI(temperature=0, model="gpt-4"),
    path="episode_info.csv",
    verbose=True,
)

#### router grand agent

# list agent tools
def python_agent_executor_wrapper(original_prompt: str) -> dict[str, Any]:
    return python_agent_executor.invoke({"input": original_prompt})

tools = [
    Tool(
        name="Python Agent",
        func=python_agent_executor_wrapper,
        description="""useful when you need to transform natural language to python and execute the python code,
                        returning the results of the code execution
                        DOES NOT ACCEPT CODE AS INPUT""",
    ),
    Tool(
        name="CSV Agent",
        func=csv_agent_executor.invoke,
        description="""useful when you need to answer question over episode_info.csv file,
                        takes an input the entire question and returns the answer after running pandas calculations""",
    ),
]

# create grand ReAct agent
prompt = base_prompt.partial(instructions="")
grand_agent = create_react_agent(
    prompt=prompt,
    llm=ChatOpenAI(temperature=0, model="gpt-4-turbo"),
    tools=tools,
)
grand_agent_executor = AgentExecutor(agent=grand_agent, tools=tools, verbose=True)

# execute grand ReAct agent and print output
print(
    grand_agent_executor.invoke(
        {
            "input": "which season has the most episodes?",
        }
    )
)

print(
    grand_agent_executor.invoke(
        {
            "input": "Generate and save in current working directory 15 qrcodes that point to `www.udemy.com/course/langchain`",
        }
    )
)

Function / Tool Calling

LangChain provides a standardized interface for connecting tools to models.

ChatModel.bind_tools(): a method for attaching tool definitions to model calls.
AIMessage.tool_calls: an attribute on the AIMessage returned from the model for easily accessing the tool aclls the model decided to make.
create_tool_calling_agent: an agent constsructor that works with ANY model that implements bind_tools and returns tool_calls.

Directly using PythonREPLTool which is already a tool object. Use with caution because Python REPL can execute arbitrary code on the host machine (e.g., delete files, make network requests).

from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain_community.tools.tavily_search import TavilySearchResults

load_dotenv()


@tool
def multiply(x: float, y: float) -> float:
    """Multiply 'x' times 'y'."""
    return x * y


prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "you're a helpful assistant"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)

tools = [TavilySearchResults(), multiply]
llm = ChatOpenAI(model="gpt-4o-mini")

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools)

response = agent_executor.invoke(
    {
        "input": "what is the weather in dubai right now? compare it with San Fransisco, output should in in celsious",
    }
)

print(response)

Tool Calling with Langchain

def multiply(a: int, b: int) -> int:
    """Multiply a and b.

    Args:
        a: first int
        b: second int
    """
    return a * b

llm_with_tools = tool_calling_model.bind_tools([multiply])

result = llm_with_tools.invoke("What is 2 multiplied by 3?")

Token Limitation Handling Strategies

when passing documents into the LLM context window, there are three approaches for handling context window limitations:

Stuffing: suff all documents into a single prompt

from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

# define prompt
prompt_template = """Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
llm_chain = LLMChain(llm=llm, prompt=prompt)

# define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
docs = loader.load()
print(stuff_chain.run(docs))

Map-reduce: summarize each document on its own in parallel and put them into a final summary.

from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain_text_splitters import CharacterTextSplitter

# Map
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes 
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)
# Reduce
reduce_template = """The following is set of summaries:
{docs}
Take these and distill it into a final, consolidated summary of the main themes. 
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
# Combine documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="docs",
    return_intermediate_steps=False,
)
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
print(map_reduce_chain.run(split_docs))

Refine: The refine documents chain constructs a response by looping over the input documents and iteratively updating its answer.

from langchain.chains.summarize import load_summarize_chain
prompt = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_template = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]

# load refine chain
chain = load_summarize_chain(
    llm=llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
    input_key="input_documents",
    output_key="output_text",
)
result = chain({"input_documents": split_docs}, return_only_outputs=True)

Coreference Resolution

Adding memory to chatbots.

LangChain provides a way to build applications that have memory using LangGraph's persistence. You can enable persistence in LangGraph applications by providing a checkpointer when compiling the graph. Every iteration, LangGraph takes the information and saves it in a DB (PostgreSQL, MySQL, Redis, and MongoDB saver).

from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph

workflow = StateGraph(state_schema=MessagesState)

# define the function that calls the model
def call_model(state: MessagesState):
    system_prompt = (
        "You are a helpful assistant. "
        "Answer all questions to the best of your ability."
    )
    messages = [SystemMessage(content=system_prompt)] + state["messages"]
    response = model.invoke(messages)
    return {"messages": response}


# define the node and edge
workflow.add_node("model", call_model)
workflow.add_edge(START, "model")

# add simple in-memory checkpointer
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)

Langchain has three main strategies to manage state:

Simply stuffing previous messages into a chat model prompt.

from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="You are a helpful assistant. Answer all questions to the best of your ability."
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

chain = prompt | model

ai_msg = chain.invoke(
    {
        "messages": [
            HumanMessage(
                content="Translate from English to French: I love programming."
            ),
            AIMessage(content="J'adore la programmation."),
            HumanMessage(content="What did you just say?"),
        ],
    }
)
print(ai_msg.content)

The above, but trimming old messages to reduce the amount of distracting information the model has to deal with.

from langchain_core.messages import trim_messages
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph

# define trimmer
# count each message as 1 "token" (token_counter=len) and keep only the last two messages
trimmer = trim_messages(strategy="last", max_tokens=2, token_counter=len)

workflow = StateGraph(state_schema=MessagesState)


# define the function that calls the model
def call_model(state: MessagesState):
    trimmed_messages = trimmer.invoke(state["messages"])
    system_prompt = (
        "You are a helpful assistant. "
        "Answer all questions to the best of your ability."
    )
    messages = [SystemMessage(content=system_prompt)] + trimmed_messages
    response = model.invoke(messages)
    return {"messages": response}


# define the node and edge
workflow.add_node("model", call_model)
workflow.add_edge(START, "model")

# ddd simple in-memory checkpointer
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)

More complex modifications like synthesizing summaries for long running conversations.

from langchain_core.messages import HumanMessage, RemoveMessage
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph

workflow = StateGraph(state_schema=MessagesState)


# Define the function that calls the model
def call_model(state: MessagesState):
    system_prompt = (
        "You are a helpful assistant. "
        "Answer all questions to the best of your ability. "
        "The provided chat history includes a summary of the earlier conversation."
    )
    system_message = SystemMessage(content=system_prompt)
    message_history = state["messages"][:-1]  # exclude the most recent user input
    # Summarize the messages if the chat history reaches a certain size
    if len(message_history) >= 4:
        last_human_message = state["messages"][-1]
        # Invoke the model to generate conversation summary
        summary_prompt = (
            "Distill the above chat messages into a single summary message. "
            "Include as many specific details as you can."
        )
        summary_message = model.invoke(
            message_history + [HumanMessage(content=summary_prompt)]
        )

        # Delete messages that we no longer want to show up
        delete_messages = [RemoveMessage(id=m.id) for m in state["messages"]]
        # Re-add user message
        human_message = HumanMessage(content=last_human_message.content)
        # Call the model with summary & response
        response = model.invoke([system_message, summary_message, human_message])
        message_updates = [summary_message, human_message, response] + delete_messages
    else:
        message_updates = model.invoke([system_message] + state["messages"])

    return {"messages": message_updates}


# Define the node and edge
workflow.add_node("model", call_model)
workflow.add_edge(START, "model")

# Add simple in-memory checkpointer
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)

Tracing Application with LangSmith

LangSmith traces LLM calls, tool usage, LLM model latency, token count, and cost.

To integrate LangSmith to our application, we need to generate an API key, add it as "LANGCHAIN_API_KEY" in environment variables, install langsmith dependency, setup our environment. Please refer to set up tracing for detailed steps.

LangChain Hub

LangChain Hub is a comprehensive platform that serves as a repository for pre-built components, tools, and configurations designed to accelerate the development of LLM applications. It simplifies the integration of various building blocks—models, prompts, chains, and agents—enabling developers to create robust and scalable applications without starting from scratch.

LangChain Text Splitter Playground

Text Splitter Playground is a user-friendly interface designed to help developers experiment with and fine-tune text-splitting strategies. In many LLM applications, particularly those involving large documents or retrieval-augmented generation (RAG), it is essential to divide text into manageable chunks while preserving context. This tool allows users to optimize the chunking process for their specific needs.

2024 December - What I Have Read

Posted on 2024-12-01

Substack

the more the AV competition globally heats up, and the more large players invest in the technology, including Tesla, the higher the demand for the Uber platform will become for these AV players.

― Uber Technologies – A brilliant business executing to perfection (Quarterly Update) - Rijnberk InvestInsights [Link]

This article predicts Uber to be a massive beneficiary of the AV / Robotaxi revolution. There indeed is a moat.

Where do LLMs spend their FLOPS? - Artificial Fintelligence [Link]

Theoretical Analysis of LLM FLOPS Allocation

FLOPS Distribution in Decoder Models:
- Based on a standard decoder model, FLOPS are allocated as follows for each layer:
  
  6d² for computing Query (Q), Key (K), and Value (V) matrices.
  
  2d² for computing the attention output matrix, using the formula: softmax(Q @ K.T) @ V.
  
  16d² for running the feedforward network (FFN).
  
  This results in a total of 24d² FLOPS per layer.
- Percentage-wise:
  
  25% of the time is spent computing QKV.
  
  ~8% is spent computing the attention output matrix.
  
  ~66% is spent running the FFN.
Attention Mechanism:

While the attention equation itself $softmax(QK^T/\sqrt{d_{head}})V$ has negligible computational cost (~0.005% for Llama7B) compared to other operations, its impact on memory usage necessitates techniques like KV cache and flash attention.
KV Cache:

The KV cache, essential for efficient attention computation, requires O(T) memory, where T is the number of tokens generated.

The memory size of the KV cache is calculated as 4 * number of layers * d_model bytes.

While the KV cache demands a significant amount of memory, it essentially reuses the same memory space throughout the token generation process.
Modern Architectures:

Architectures like Mistral 7B and Llama2 employ Grouped Query Attention (GQA) and sliding window attention to optimize performance.

GQA reduces the KV cache size by sharing the KV projection across multiple heads. This leads to a linear decrease in memory consumption as the number of KV heads decreases.

Sliding window attention limits the KV cache size to the window size (e.g., 4096 for Llama7B), further controlling memory usage.

Performance-Motivated Architectural Changes

Impact of Model Width and Depth:

Increasing the number of layers linearly scales both the FLOPS and the number of parameters.

Increasing the model width (d_model) quadratically scales the number of parameters and, consequently, the compute requirements.

This is because weight matrices within layers have a size of (d_model, d_model), leading to a quadratic relationship between model width and parameters.
Balancing Latency and Parallelization:

Wider models parallelize better due to the ease of splitting layers across multiple GPUs using tensor parallelism.

Deeper models require sequential computation of layers, hindering parallelization, especially during training.

Therefore, wider models are preferred when low latency is critical.

Empirical Analysis of LLM Performance

The article investigates the memory usage of the KV cache in LLMs, specifically Llama2. The author observes that the actual memory consumed by the model is higher than what the theoretical calculations suggest. Here's how the discrepancy is highlighted:

Theoretical Calculation: The formula for calculating the KV cache memory requirement per token is 4 * number of layers * d_model bytes. In the experiment, the Llama2 model used has d_model of 1024 and 8 hidden layers. This means it theoretically needs 32KB of memory per token (4 * 8 * 1024 bytes = 32KB).
Expected Memory Usage: For generating 20 tokens, the model should ideally use 640KB of memory (32KB/token * 20 tokens = 640KB).
Observed Memory Usage: However, the empirical analysis revealed that the model's memory consumption jumped by ~2.1MB every 20 tokens. This is significantly higher than the expected 640KB.

The author concludes that this discrepancy of about 3x suggests an inefficient implementation of the KV cache in the model being used. The extra overhead could stem from various factors not accounted for in the theoretical calculation, and further investigation would be needed to pinpoint the exact cause.

Transformer inference tricks - Artificial Fintelligence [Link]

KV Cache

The KV cache is a crucial optimization for decoder models, significantly reducing computation. It exploits the fact that keys and values remain constant for the prompt and each decoded token in subsequent iterations. By caching these values, the computational complexity of sampling becomes linear instead of quadratic, enabling decent performance with longer contexts.
However, it introduces state management complexity, as inference needs to continue for all sequences even if some are completed. The KV cache demands significant memory, proportional to the number of layers, heads, and the embedding dimension. For instance, GPT-3 with 96 layers, 96 heads, and a dimension of 128 requires 2.4M parameters per token, translating to 10GB of memory for a 2048 token context window. This memory requirement is a major challenge for consumer-grade GPUs with limited HBM, like the 4090.

Speculative Decoding

Speculative decoding leverages excess compute capacity, particularly in local inference settings. It utilizes two models: a small, fast “draft” model and a large, slow model. The smaller model quickly makes multiple inferences, guessing the large model’s predictions, while the larger model verifies these guesses in parallel. This effectively reduces the sequential cost of generating a sequence to that of the smaller model.
However, it requires training and storing both models, and performance is limited by the smaller model’s prediction accuracy. HuggingFace reports a typical doubling of decoding rate using this technique.
Newer techniques like Jacobi decoding and lookahead decoding aim to improve upon speculative decoding by generating n-grams and recursively matching them, potentially achieving latency improvements without requiring a draft model.

Effective Sparsity

Sparsity in transformer activations arises from the softmax operation in the attention mechanism and ReLU activations in MLPs, leading to many zero values. Utilizing this sparsity can be challenging, with limited support in mainstream tensor programs.
One optimization involves skipping computations for zero activations, feasible in custom implementations like Llama.cpp. However, the effectiveness of this approach diminishes exponentially with batch size due to the random distribution of sparsity across tokens.
Therefore, leveraging sparsity is most effective for batch size 1, although speculative decoding might be more beneficial in such scenarios.

Quantization

Quantization reduces the precision of model weights and activations, potentially saving memory and increasing inference speed. Research suggests that quantization to 4 bits or more results in negligible performance degradation. The k-bit inference scaling laws paper demonstrates that reducing precision allows for using a larger model with the same memory footprint and potentially achieving better performance.
However, using lower precision formats may lack native support in hardware and could be unstable in production environments. FP8 offers a good balance between performance and support, being the lowest precision format natively supported by modern accelerators. Int8 is another option, easier to implement with tools like PyTorch, though it lacks the performance advantages of FP8.
Libraries like bitsandbytes facilitate quantization, offering tools and APIs for implementation.

Top 10 China's AI Stories in 2024: A Year-End Review - Recode China AI [Link]

China's AI landscape is rapidly catching up to the US, with multiple models now reaching similar performance benchmarks as GPT-4 and advancements in areas like video generation, robotics, and autonomous driving.

Several AI-powered apps have emerged in China, with ByteDance's Doubao leading in popularity domestically and MiniMax's Talkie gaining traction internationally, though China has yet to produce a "killer app" with at least 100 million daily active users.

A number of Chinese AI startups have emerged since ChatGPT's debut, backed by significant capital, but they now face strong competition from tech giants.

Chinese open-source LLMs have made substantial global progress, with Alibaba’s Qwen series being the most downloaded on Hugging Face.

Chinese AI video generators have surged ahead due to the delayed release of Sora, with platforms like Kuaishou’s Kling and MiniMax’s Hailuo offering competitive features.

An LLM API price war has been ignited by major Chinese tech companies, with significant price reductions for developers and SMEs.

China's semiconductor industry faces challenges due to US restrictions but is also making strides in self-sufficiency, with companies like Huawei pushing forward on competitive AI chips.

China's robotaxi industry is gaining momentum, with Baidu's Apollo Go expanding its fleet and other self-driving startups completing IPOs.

OpenAI and Microsoft have tightened AI access in China, prompting Chinese AI companies to offer alternatives and accelerating the development of homegrown models.

China is seeing a robotics boom with rapid innovation in humanoid and other types of robots, though challenges remain in complex tasks and high production costs.

AI resurrection is becoming increasingly accessible, raising ethical and legal questions as companies offer services to create digital replicas of the deceased.

Finetuning LLM Judges for Evaluation - Deep (Learning) Focus [Link]

This article discusses the challenges of evaluating LLMs and how finetuning specialized LLM judges can improve the evaluation process. Here's how the logic of the article flows:

The article notes that while human evaluation is the most reliable method, it is also expensive, time-consuming, and not scalable. This creates a need for efficient ways to test LLM capabilities.

There are two primary evaluation approaches: human evaluation and automatic metrics.

Human evaluation is considered the "definitive source of truth" but is recognized as noisy, subjective and prone to bias.
Automatic metrics are used to speed up model development, but they are imperfect proxies for human opinions. The article further divides automatic metrics into two categories: traditional metrics and model-based evaluation.
Traditional metrics like ROUGE and BLEU are reference-based, comparing LLM outputs to "golden" answers, and are less effective for modern LLMs which are open-ended and can produce many valid responses.
LLM-as-a-Judge is introduced as a model-based approach, using a powerful LLM to evaluate another LLM's output. This method is effective, easy to implement, and can handle open-ended tasks.

While effective, LLM-as-a-Judge has limitations, including a lack of transparency, security concerns, cost, and a lack of specialization for domain-specific evaluations. The article argues that these limitations can be addressed by training specialized LLM judges.

Meta-evaluation involves assessing the performance of the LLM judge by comparing its output to high-quality human evaluation data.
Early research on finetuned LLM judges were created as open source replacements for proprietary LLMs. The original LLM-as-a-Judge paper also explored finetuning, and found that a finetuned Vicuna-13B model showed potential. The need for finetuning is further justified because proprietary LLMs can be expensive, lack control or transparency, and because open source models are becoming more capable. The article discusses how a Vicuna-13B model was improved by finetuning on human votes from Chatbot Arena, though it still fell short of GPT-4 performance.
Several examples of finetuned LLM judges:
- PandaLM: This model is designed to identify the best model among a set, particularly useful for hyperparameter tuning. It is trained on a dataset of over 300K examples with instructions, inputs, paired responses, evaluation results and rationales. PandaLM is effective in specialized domains like law and biology.
- JudgeLM: This model focuses on the factors that contribute most to the quality of a judge model, such as data quality and diversity, base model size, bias, and generalization. JudgeLM uses a high-quality, diverse dataset and is trained to mitigate bias, including positional, knowledge, and format biases.
- Auto-J: This model is designed for domain-specific grading, with an emphasis on providing high-quality, structured explanations. It is trained on real-world queries and responses and can perform both pairwise and direct assessment scoring.
Other related research using LLMs for critiques, verification, and generating synthetic training data.

Prometheus is a key development in finetuned LLM judges, capable of fine-grained, domain-specific evaluation. It is trained to ingest custom scoring rubrics as input.

The Prometheus model uses the Feedback Collection dataset, which includes instructions, responses, rubrics, reference answers, rationales, and scores. It is trained to sequentially provide feedback and then score the response using a supervised finetuning (SFT) strategy.
Prometheus 2 is introduced as an extension that can handle both direct assessment and pairwise scoring. It is trained on both the Feedback Collection and the Preference Collection, and uses a linear model merging approach to combine models trained for the two scoring formats.
Prometheus-Vision extends the Prometheus concept to Vision-Language Models (VLMs). It uses a dataset called the Perception Collection, which includes images, instructions, responses, rubrics and reference answers.

Other types of finetuned judges, including:

Self-rewarding LLMs, which use the LLM itself to provide its own rewards and feedback.
LLM-as-a-Meta-Judge, which allows the LLM judge to self-improve.
Self-taught evaluators, which train evaluators without human preference data.
Foundational Large Autorater Models (FLAMe), which are trained on a massive amount of human preference data and generalize well to other tasks.
Direct judgement preference optimization, which uses preference optimization to create more advanced evaluation capabilities.

A generic framework based on the Prometheus model for creating a finetuned LLM judge. The steps include:

Solidifying evaluation criteria.
Preparing a high-quality dataset.
Using synthetic data.
Focusing on the rationales for each score.
Training the model using SFT and meta-evaluating its performance.

E-Commerce Unleashed - App Economy Insights [link]

Highlights discussion points:

Cyber week trends.

"Cyber Week (from Black Friday to Cyber Monday) showcased shifting consumer behaviors and the growing dominance of e-commerce."
Shopify’s acceleration.

Shopify has evolved from a platform for small businesses into a global enabler for merchants, offering tools to scale internationally. Its emphasis on payments, particularly through Shop Pay, has been pivotal, with Shop Pay emerging as a high-conversion checkout option. In Q3, Gross Payment Volume accounted for 62% of Shopify’s Gross Merchandise Volume, marking a 4% year-over-year increase. Additionally, Shopify's partnership with Amazon to integrate Prime benefits directly into Shopify stores represents a strategic move to boost customer loyalty and conversions by leveraging Amazon's trusted fulfillment network and extensive Prime membership base.
Amazon takes on Temu.

Amazon has launched Amazon Haul, a new storefront aimed at attracting budget-conscious shoppers and safeguarding its market position. This initiative is strategically designed to meet the increasing demand for affordable e-commerce solutions.
Walmart’s advertising play.

Walmart is redefining modern retail by merging its extensive physical presence with advanced digital capabilities to create a powerful omnichannel strategy. The company leverages first-party data and its retail media network to maintain a competitive edge.

Walmart Connect integrates online and in-store advertising, allowing brands to engage customers at their preferred shopping points. By utilizing vast first-party data, Walmart delivers targeted and relevant ads, enhancing both advertiser returns and customer satisfaction. The platform is also attracting advertisers from diverse industries, including automotive and financial services.

Walmart’s planned acquisition of Vizio marks its entry into connected TV advertising, broadening Walmart Connect’s reach into households through smart TVs and enhancing inventory visibility and supply chain integration through improved data capabilities. This positions Walmart as a leader in omnichannel retail and advertising.
AI: The quiet game changer.

AI played a transformative role during Cyber Week, enhancing the shopping experience across various dimensions. Hyper-personalized shopping was driven by AI recommendation engines, which anticipated consumer needs and boosted conversions, exemplified by features like Amazon’s “frequently bought together.” Generative AI tools, such as chatbots, simplified product discovery during the busy sales period, with innovations like Amazon Q offering AI-generated review summaries to streamline decision-making.

AI also optimized logistics through demand forecasting, ensuring products remained in stock and reducing shipping delays. In payments, real-time AI fraud detection provided secure checkouts on platforms like Walmart and Shopify. Additionally, AI tools like Shopify’s Sidekick and Magic enhanced product descriptions, SEO strategies, and customer support, further elevating the e-commerce experience. These advancements underscored AI's critical role in reshaping retail during one of the busiest shopping weeks of the year.

AI presents new challenges for incumbents but also drives significant innovation and growth.

― Salesforce: The Agent Wave - App Economy Insights [Link]

The company’s autonomous AI platform - Agentforce - was introduced in Sep 2024 and launched in late Oct. Agentforce enables businesses to deploy AI agents for tasks such as sales, marketing, and customer support. This marks a pivotal step in Salesforce’s platform strategy, with far-reaching implications. CEO Marc Benioff views Agentforce as transformative, positioning it at the core of a shift toward “agent-first companies.” In this model, AI not only assists humans but fundamentally redefines business operations by automating processes and enhancing productivity.

What to watch:

Salesforce recently completed its acquisition of Own and Zoomin, reinforcing its Data Cloud capabilities.
Salesforce Ventures announced a new $\$500$ million AI fund, targeting high-profile AI startups like Anthropic, Mistral, and Cohere, supporting Salesforce’s efforts to remain at the forefront of enterprise AI.
Clara Shih, CEO of Salesforce AI left Salesforce to set up a new Business AI group at Meta, aiming to build AI tools for businesses of all sizes. Shih’s departure highlights the intensity of the AI talent war, which will be a fascinating layer to watch in the coming year.

OpenAI's o1 using "search" was a PSYOP - Interconnects [Link]

The article primarily argues that OpenAI's o1 model does not use explicit search at test time, and its apparent search capabilities are a result of reinforcement learning (RL) during training. The author argues against the idea that o1 uses online search at test time or intermediate rewards during training. The article posits that the "suspects" are reduced to "Guess + Check" and "Learning to Correct". They uses the test-time compute plot, and the training process, as key points in their argument to show how o1 can achieve high performance using RL with controlled training data and no explicit search during inference.

One major source of this idea is Sasha Rush's lecture on Test Time Scaling (o1).

Insurance companies aren't the main villain of the U.S. health system - Nahpinion [Link]

This article argues that health insurance companies are not the primary cause of high healthcare costs in the United States12. Instead, the article argues that the excessive prices charged by healthcare providers are the main reason for the high cost of healthcare. The article suggests that focusing anger on insurance companies is "shooting the messenger," and the solution is to reduce costs within the medical system itself, such as having the government negotiate lower prices with providers.

Evidences are: insurance companies have low profit margins, spend much more on medical costs than they make in profit. Americans pay a smaller percentage of their health costs out of pocket than people in most other rich countries. This suggests that US health insurers are paying a higher percentage of costs than government insurance systems in other countries. The cost of healthcare provision in the U.S. is too high. The actual people charging high prices are the providers themselves, such as hospitals, pharmaceutical companies, and system. They outsource the actual collection of these fees to insurance companies.

15 Times to use AI, and 5 Not to - One Useful Thing [Link]

When to Use AI:

Use AI for tasks that require generating a high quantity of ideas, such as in brainstorming sessions.

AI is useful when you are an expert and can quickly judge the quality of its output.

AI can summarize large amounts of information where minor errors are acceptable.

Use AI for translating information between different formats or audiences.

AI can help you overcome creative blocks by providing multiple options to move forward.

Use AI when it is known to be better than any available human option, and its errors won't cause significant problems.

Use AI as a companion when reading to get help with context and details. (very helpful to me)

AI can provide a variety of solutions, allowing you to curate the best ones.

AI is helpful for tasks where research has proven it to be effective, like coding.

Use AI to get a first look at how different audiences might react to your work.

AI can act as a competent co-founder for entrepreneurial ventures.

Use AI to get a specific perspective, such as reactions from fictional personas.

AI can help with tasks that are ritualistic and have lost their purpose.

Use AI to get a second opinion by comparing its conclusions with yours.

Use AI when it can perform a task better than humans.

When Not to Use AI:

Avoid AI when you need to learn and synthesize new ideas, as it is not the same as reading and thinking yourself.

Do not use AI when very high accuracy is essential because AI errors can be very plausible and hard to spot.

Avoid AI if you do not understand its failure modes, such as hallucinations or persuasiveness.

Do not use AI when the struggle with a topic is necessary for success and learning.

Avoid AI when it is bad at a specific task.

Oracle : The 4th Hyperscaler? - App Economy Insights [Link]

Google released the first version of its Gemini 2.0 family of artificial intelligence models on December 11th, 2024. Including its Chrome browser automation product called Mariner.

Project Astra and Mariner along with NotebookLM remain very intriguing AI products by Google in 2025.

Gemini 2 and the rise of multi-modal AI - AI Supremacy [Link]

Incredible.

Figure source: Peter Gostev on Linkedin

Palantir Unclassified! Equity Research! - Global Equity Briefing [Link]

Palantir is a software company that provides tools for analyzing large datasets, which enable users to make better decisions. Founded in the early 2000s, Palantir initially offered services to government agencies, including the US intelligence community, to combat terrorism. The CIA was one of their first investors. Palantir's software is also used by corporations to improve operations and decision-making.

Business Model

Palantir operates as a Software as a Service (SaaS) company, offering a suite of customizable products for which clients pay a licensing fee. The company has two operating segments: government and commercial.

Government Sales: Palantir provides services to government institutions, recognizing a gap in the market due to many Silicon Valley companies not wanting to work with governments. These contracts are often long-term, providing predictable revenue streams. The company benefits from the transparency of government information, and it is easier for them to predict needs and market their software.

Commercial Sales: Palantir's solutions are used across many industries by various employees from production line workers to CEOs. The use cases for Palantir software in the commercial sector are extensive.

Customer Acquisition: Palantir targets large organizations with complex problems, which increases their competitive advantage. Solving difficult problems first earns customer trust.

Products: Gotham, Foundry, Apollo, and AIP.

Gotham: It is a government-focused platform that allows users to analyze large datasets to make better decisions and find hidden connections, with the goal of improving operations and decision-making.
Foundry: This is a commercial platform that allows large and complex companies to integrate, visualize, and analyze their data to optimize their operations and value chain.
Apollo: This is a platform for continuous software deployment, enabling secure and seamless delivery of software across various environments for Palantir's clients.
AIP: Palantir's newest offering, it is a platform for organizations to create customized AI tools using their own data, providing accurate and detailed answers to specific questions.

Opportunities

Palantir can benefit from the growing demand for digital twins, which are exact digital replicas of real-world items used for integration, monitoring, simulation, and maintenance. The digital twin market is projected to grow significantly. Palantir is positioned to benefit from the AI revolution with its AIP platform, and its other products also use AI. The global AI market is expected to reach $\$1.84$ trillion by 2030. Palantir is developing industry-specific operating systems, like Skywise for the airline industry. These operating systems are sticky and offer significant revenue opportunities. The healthcare industry could be a large market for such systems. Palantir's commercial sector is growing, and there are significant opportunities for international expansion.

Is AI hitting a wall? - Strange Loop Canon [Link]

Arguments that suggest AI progress is hitting a wall include the observation that pre-training scaling has plateaued, meaning simply increasing model size and data may not yield the same improvements as before. Also, current evaluation benchmarks may be saturated, failing to assess deeper work, since they are based on human tests or simple recall. Current AI models struggle with real-world tasks due to issues like hallucination and a lack of creative planning, even if they appear human-level in individual evaluations. Finally, the visible effects of scaling are limited, with reduced cross-entropy loss not translating to significant improvements for observers.

Conversely, arguments against AI progress hitting a wall emphasize the presence of large amounts of unused data, including various types like conversations and video data. The use of synthetic data can enhance learning by converting existing data into different formats and testing it against real-world scenarios. AI models are now being taught reasoning, enabling them to "think for longer" and improving performance in areas requiring clear thought processes. Additionally, there is the possibility of exploring new S-curves or scaling laws. New models are also capable of expert-level work that is not captured by current benchmarks, potentially speeding up scientific research. Finally, AI models can now interact with digital systems, and are becoming more aware of the world.

Our Healthcare System, a Reign of Terror - Freddie deBoer [Link]

An Assassin Showed Just How Angry America Really Is - BIG by Matt Stoller [Link]

OpenAI o3 Model Is a Message From the Future: Update All You Think You Know About AI - The Algorithmic Bridge [Link]

OpenAI's o3: The grand finale of AI in 2024 - Interconnects [Link]

Key performance points:

ARC AGI Prize: o3 is the first model to surpass the 85% threshold for completing the ARC AGI prize on the public set, though it exceeded cost constraints. It achieved 87% accuracy on the public set with high compute, and 76% with low compute. For context, prior to o1-class models, OpenAI’s best model, GPT-4o, only achieved 5% accuracy. The ARC AGI challenge is designed to evaluate human-like general fluid intelligence.
Frontier Math Benchmark: o3 demonstrates a substantial improvement on the Frontier Math benchmark, increasing performance from 2% to 25%. This benchmark is considered extremely challenging, with one Fields Medalist stating that the problems "will resist AIs for several years at least".
Coding Benchmarks: o3 has made significant improvements on leading coding benchmarks such as SWE-Bench-Verified, achieving a score of 71.7%. On the Codeforces competition coding site, o3 achieved a score of 2727 with consensus voting, placing it at the International Grandmaster level and approximately in the top 200 of competitive human coders.
Reasoning Capabilities: o3 represents a major advancement in reasoning evaluations, signaling that the industry is moving beyond pretraining on internet text. It is expected to accelerate the rate of progress in AI research.
Inference and Cost: o3 was tested with two levels of compute with different sample sizes: a high-efficiency configuration with a sample size of 6, and a low-efficiency configuration with a sample size of 1024 which used 172 times more compute. The cost of running o3 at the higher level of compute was approximately $\$5000$ per query. It is speculated that the core mechanism of o3 involves natural language program search and execution within token space, searching over Chains of Thought (CoTs).
Availability: The o3 model, including the o3-mini version, is expected to be available to the general public in late January 2025. The o3-mini is expected to be more impactful for the general public due to its lower cost, while still outperforming o1.

o3, AGI, the art of the demo, and what you can expect in 2025 - Marcus on AI [Link]

o3 “ARC AGI” postmortem megathread: why things got heated, what went wrong, and what it all means - Marcus on AI [Link]

Gary Marcus critiques OpenAI's new model o3, arguing that its impressive demo, while showcasing advancements in math and coding, was carefully curated and lacks broader application.

The public did not get to try the system, and it was not vetted by the scientific community. OpenAI chose what to highlight about o3. Marcus argues that until many people get to try o3 on different tasks, its reliability should not be assumed.
The o3 demo primarily focused on math, coding, and IQ-like puzzles, with no evidence that it can work reliably in open-ended domains. It was not tested on problems where massive data augmentation was not possible. The demo did not address the most important question about the system's capabilities in open-ended domains.
The o3 system is incredibly expensive. One estimate suggests that each call to the system might cost $1000. Even if the cost is reduced, it might still not be as good or as versatile as top STEM graduates.
The o3's performance on the ARC-AGI test was misleading. The test is at most a necessary, but not sufficient, condition for AGI, and does not address important areas such as factuality, compositionality, and common sense.
The core problem of neural networks generalizing better "within distribution" than "outside distribution" has not been solved.

Note to Our Energy Sucking Overlords - Michael Spencer [Link]

The rapid growth of AI is causing a surge in demand for data centers, which in turn are becoming major consumers of electricity. The energy needs of AI are growing so large that tech companies are seeking reliable power sources beyond renewable energy. The rising energy consumption of AI infrastructure will likely result in higher energy prices, potentially creating competition between Big Tech and the communities where they build data centers. To meet their energy needs, major technology companies are becoming more involved in the energy sector, including investments in nuclear and natural gas plants. The current trajectory of AI infrastructure expansion and energy consumption is unsustainable and could lead to significant challenges for society. The US is building data centers abroad in Europe and Asia, thereby maintaining their power and also acquiring cheaper labor.

Summary of statistics:

Energy Consumption of AI tasks: A single task on the ARC-AGI benchmark using OpenAI's o3 model consumes approximately 1,785 kWh of energy, which is equivalent to the electricity used by an average U.S. household in two months. This task also generates 684 kg CO₂e, which is equivalent to the carbon emissions from more than 5 full tanks of gas.
Investments in AI Infrastructure: In 2024, major players like Amazon, Microsoft, and Alphabet spent over $\$240$ billion on AI-related infrastructure. In 2025, Amazon, Google, Meta, and Microsoft are expected to spend $\$300$ billion in capital expenditures.
Data Center Electricity Consumption: Global data center electricity consumption is expected to more than double between 2023 and 2028. The IDC expects consumption to reach 857 Terawatt hours (TWh) in 2028.
US Data Center Energy Usage: U.S. data centers could use 6.7 to 12% of all energy demand nationwide by 2028. In 2023, data centers used 4.4% of total US power consumption, which is projected to rise to as high as 12% by 2028. This is a spike of more than threefold in the next four years.
Data Center Locations and Power:
- Northern Virginia has over 300 data centers with approximately 3,945 megawatts of commissioned power.
- The Dallas region has 150 data centers.
- Silicon Valley has over 160 data centers.
- Phoenix has over 100 data centers with around 1,380 megawatts of power.
- Chicago has more than 110 data centers.
Data Center Projects:
- OpenAI plans to construct massive 5-gigawatt (GW) data centers across the US.
- Oklo will build small modular reactors (SMR) by 2044 to generate 12 gigawatts of electricity for data centers.
- Meta announced a $\$10$ billion development for a 4 million sq ft, 2 GW data center campus in Louisiana.
- Entergy is proposing to develop a 1.5GW natural gas plant in Louisiana to power a data center.
- Amazon Web Services (AWS) plans to invest $\$11$ billion in a new data center campus in Northern Indiana.
Generative AI Market: The generative AI market was valued at $\$6$ billion in 2023 and could reach $\$59$ billion in 2028.
Increased US power demand: Data centers are one of the key reasons US power demand is expected to jump 16% over the next five years.
Cost of Electricity for Data Centers: Electricity is the largest ongoing expense for data center operators, accounting for 46% of total spending for enterprise data centers and 60% for service provider data centers.
The potential for data centers to consume as much energy as entire industrialized economies: By 2030, US data centers could consume as much electricity as some entire industrialized economies.
Big Oil's Role: Big oil companies like ExxonMobil and Chevron are moving into the AI datacenter energy market. Exxon plans to build a natural gas plant to power a data center, and estimates that decarbonizing AI data centers could represent up to 20% of its total addressable market for carbon capture and storage by 2050.

What are the checks and balances on the power of Elon Musk? - Noahpinion [Link]

The article examines the significant influence of Elon Musk on U.S. politics, particularly his role in derailing a Congressional spending bill. It explores whether Musk's actions represent a threat to democratic processes, considering his control over X (formerly Twitter) and SpaceX. The author presents contrasting views of Musk—"Real Elon" versus "Evil Elon"—highlighting the uncertainty surrounding his motives and the lack of institutional checks on his power. The piece concludes by suggesting that public opinion ultimately holds sway over Musk's influence, though the potential for a powerful backlash remains to be seen.

Is AI progress slowing down? - AI SHAKE OIL [Link]

The authors argue that the recent shift away from model scaling towards inference scaling is not necessarily indicative of a slowdown, but rather a change in approach. They caution against over-reliance on industry insiders' predictions due to their inherent biases, emphasizing that progress is less predictable and more dependent on algorithmic innovation than previously assumed. Furthermore, the essay highlights the significant lag between capability advancements and real-world applications, suggesting that the focus should shift towards product development and user adoption rather than solely on model capabilities. Finally, the authors offer a more nuanced perspective on the current state of AI progress, acknowledging the potential of inference scaling while emphasizing the importance of considering broader factors beyond pure technological advancement.

The Critical AI Report, December 2024 Edition - Blood in the Machine [Link]

Gen AI's actual impact on workers so far:

Waymo: Rideshare Revolution - App Economy Insights [Link]

Manufacturing is a war now - Noahpinion [Link]

The article argues that China's dominance in manufacturing, particularly in crucial areas like drone production and batteries, poses a significant threat to the United States and its allies.

Source: https://mipforum.org/wp-content/uploads/2024/11/MIPF-Conference-Paper-FINAL-WEB.pdf

Articles and Blogs

Meet Willow, our state-of-the-art quantum chip - Google Research [Link]

Google has developed a new quantum chip called Willow, which significantly reduces errors as it scales up, a major breakthrough in quantum error correction. Willow also performed a computation in under five minutes that would take a supercomputer 10 septillion years, demonstrating its potential for solving complex problems beyond the reach of classical computers. This achievement marks a significant step towards building commercially relevant quantum computers that can revolutionize fields like medicine, energy, and AI.

Quantum Computing Roadmap:

Terms to keep in mind:

Willow: Google's latest 105-qubit superconducting processor, which is the first to demonstrate exponential error suppression with increasing surface code size.
Below Threshold: A milestone in quantum computing where the error rate decreases as the number of qubits increases, demonstrating effective error correction.
Logical Qubit: A fault-tolerant qubit created from multiple physical qubits using error correction techniques, providing a more stable and reliable unit of computation.
Random Circuit Sampling (RCS): A benchmark test that assesses the ability of a quantum computer to perform computations beyond the capabilities of classical computers.
T1 Time: A measure of how long a qubit can maintain its quantum state before decoherence sets in.
Quantum Algorithms: Algorithms specifically designed to be executed on quantum computers, leveraging quantum phenomena to solve problems more efficiently.

Making quantum error correction work - Google Research [Link]

The ultimate vision of them is to build a large-scale, fault-tolerant quantum computer that can run complex quantum algorithms and unlock the potential of quantum computing for scientific discovery and various applications.

Terms to keep in mind:

Repetition codes: A type of quantum error correction that focuses solely on bitflip errors and achieves lower encoded error rates.
Quantum error decoder: Classical software that processes measurement information from the quantum computer to identify and correct errors.

AI Hallucinations: Why Large Language Models Make Things Up (And How to Fix It) - kapa.ai [Link]

Why Do LLMs Hallucinate?

LLMs predict upcoming words in a sequence based on patterns in training data. They lack true reasoning or comprehension abilities, so they rely only on these word probability patterns instead of genuine understanding of the topics they discuss.
Architecture limitations: 1) fixed attention window in transformer limits input context leading to earlier information being dropped, 2) sequential token generation mechanism has no revision process, so initial errors can compound to major inaccuracies in the output.
Limitations of probabilistic generation: 1) models can produce plausible-sounding responses that lack actual comprehension of subjects, 2) value prompts lead LLMs to try to "fill in the blanks" resulting in fabricated or inaccurate answers.
Training data gaps: 1) models are trained on ground-truth training data while they do inference on their own, this can create a feedback loop where minor errors become amplified, 2) when prompt falls outside the scope of training data, the model will likely generate a hallucinated response.

How to Mitigate AI Hallucination?

Input layer mitigation strategies
- Query processing; context size optimization; context injection.
Design layer mitigation strategies
- Chain-of-Thought prompting; Retrieval-Augmented Generation (RAG); Fine-tuning
Output layer mitigation strategies
- Rule-based filtering; output re-ranking; fact-checking and verification; encourage contextual awareness.

The next chapter of the Gemini era for developers - Google Blog [Link]

API starter code, Code Experiments (Data Science Agents, etc), Google AI Studio

Gemini 2.0 Flash is an experimental AI model that builds upon the success of Gemini 1.5 Flash. It offers enhanced capabilities for developers to build immersive and interactive applications.

Functionalities and Capabilities of Gemini 2.0 Flash:

Enhanced Performance: It is twice as fast as Gemini 1.5 Pro with improved multimodal, text, code, video, spatial understanding, and reasoning performance.
New Output Modalities:

Gemini 2.0 Flash allows developers to generate integrated responses, including text, audio, and images, through a single API call. It features native text-to-speech audio output with control over voice, language, and accents. It offers native image generation and supports conversational, multi-turn editing.
Native Tool Use: Gemini 2.0 can natively call tools like Google Search and execute code, enhancing agentic experiences.
Multimodal Live API: It enables the development of real-time, multimodal applications with audio and video-streaming inputs.

AI-powered Coding Agents in Gemini 2.0:

Jules: An experimental AI-powered code agent that utilizes Gemini 2.0 to handle Python and Javascript coding tasks. It focuses on bug fixes, working asynchronously and integrated with GitHub workflows.
Colab's Data Science Agent: Utilizes Gemini 2.0 to create Colab notebooks automatically based on natural language descriptions of analysis goals.

Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning - Microsoft AI Platform Blog [Link]

a16z's big ideas in tech for 2025

Andreessen Horowitz published a new list of requests for startups to build.

(𝗦𝗲𝗹𝗳) 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁

How to Be Successful - Sam Altman (blog)

Career Algorithm - Hemant Mohapatra (blog)

What I Wish I Knew at 20 - Tina Seelig (book)

Cold Start Algorithm - Boz(blog)

Design Your Life - Bill Burnett (book)

Good PM, Bad PM - Ben Horowitz (blog)

OKRs - John Doerr (blog)

𝗟𝗲𝗮𝗱𝗲𝗿𝘀𝗵𝗶𝗽

Netscape Aphorisms - Jim Barksdale (twitter)

What You Do Is Who You Are - Horowitz (book)

Giving Away Legos - Molly Graham (blog)

Extreme Ownership - Jocko Willink (book)

Founder Mode - Paul Graham (blog)

― Great startup leadership frameworks [Link]

I think the biggest competitive advantage in business—either for a company or for an individual’s career—is long-term thinking with a broad view of how different systems in the world are going to come together. One of the notable aspects of compound growth is that the furthest out years are the most important. In a world where almost no one takes a truly long-term view, the market richly rewards those who do.

Most highly successful people have been really right about the future at least once at a time when people thought they were wrong. If not, they would have faced much more competition.

Thinking from first principles and trying to generate new ideas is fun, and finding people to exchange them with is a great way to get better at this. The next step is to find easy, fast ways to test these ideas in the real world.

All great careers, to some degree, become sales jobs. You have to evangelize your plans to customers, prospective employees, the press, investors, etc. This requires an inspiring vision, strong communication skills, some degree of charisma, and evidence of execution ability.

It’s often easier to take risks early in your career; you don’t have much to lose, and you potentially have a lot to gain.

Almost everyone I’ve ever met would be well-served by spending more time thinking about what to focus on. It is much more important to work on the right thing than it is to work many hours. Most people waste most of their time on stuff that doesn’t matter.

You can get to about the 90th percentile in your field by working either smart or hard, which is still a great accomplishment. But getting to the 99th percentile requires both.

You have to figure out how to work hard without burning out. Work stamina seems to be one of the biggest predictors of long-term success.

If you are making progress on an important problem, you will have a constant tailwind of people wanting to help you. Let yourself grow more ambitious, and don’t be afraid to work on what you really want to work on.

Follow your curiosity. Things that seem exciting to you will often seem exciting to other people too.

People have an enormous capacity to make things happen. A combination of self-doubt, giving up too early, and not pushing hard enough prevents most people from ever reaching anywhere near their potential.

The best way to become difficult to compete with is to build up leverage. For example, you can do it with personal relationships, by building a strong personal brand, or by getting good at the intersection of multiple different fields.

An effective way to build a network is to help people as much as you can.

One of the best ways to build a network is to develop a reputation for really taking care of the people who work with you.

Define yourself by your strengths, not your weaknesses. Acknowledge your weaknesses and figure out how to work around them, but don’t let them stop you from doing what you want to do.

Remember to spend your time with positive people who support your ambitions.

You get truly rich by owning things that increase rapidly in value. The best way to make things that increase rapidly in value is by making things people want at scale.

Time only scales linearly.

Eventually, you will define your success by performing excellent work in areas that are important to you. The sooner you can start off in that direction, the further you will be able to go.

― How to Be Successful - Sam Altman [Link]

Great advice. I need to keep in mind.

Compound yourself
Have almost too much self-belief
Learn to think independently
Get good at “sales”
Make it easy to take risks
Focus
work hard
Be bold
Be willful
Be hard to compete with
Build a network
You get rich by owning things
Be internally driven

Y Combinator: how to make the most out of your 20s

Marc Andreessen's Guide to Personal Productivity

Advancing red teaming with people and AI - Open AI [Link]

OpenAI's two new papers detail their advanced red teaming techniques for assessing AI safety. External red teaming uses human experts to probe AI models for vulnerabilities and risks, while automated red teaming employs AI to generate diverse attacks at scale. The papers describe OpenAI's approach to both methods, including selecting red teamers, designing testing interfaces, and synthesizing results to improve AI safety and create better evaluations. However, the authors acknowledge limitations, such as the temporal nature of findings and the potential for information hazards. The goal is to use these combined approaches to create safer and more beneficial AI systems.

Bringing Grok to Everyone - X.Ai [Link]

Processing billions of events in real time at Twitter - X Engineering [Link]

Twitter's data infrastructure underwent a significant upgrade, migrating from a lambda architecture to a kappa architecture built on a hybrid of on-premise and Google Cloud Platform systems. This new system processes 400 billion events daily, improving real-time data accuracy and reducing latency. The new architecture leverages Kafka, Dataflow, and BigTable, achieving near-exactly-once processing and significantly improved performance, as demonstrated by a system performance comparison. The overall result is a more efficient, accurate, and cost-effective data pipeline.

To handle this massive volume, Twitter's data infrastructure employs a combination of tools and platforms:

Scalding: Used for batch processing
Heron: Used for streaming data
TimeSeries AggregatoR (TSAR): An integrated framework for both batch and real-time processing
Data Access Layer: Enables data discovery and consumption

Twitter's interaction and engagement pipeline processes high-scale data in batch and real time, collecting data from various sources like real-time streams, server logs, and client logs. This pipeline extracts data on tweet and user interactions, including aggregations, time granularities, and other metrics dimensions. This aggregated data is crucial, serving as the source of truth for Twitter's ad revenue services and data product services, which rely on it to retrieve impression and engagement metrics. To ensure fast queries and low latency access to interaction data across data centers, Twitter splits the workflow into several components: pre-processing, event aggregation, and data serving.

The Transformer Architecture: A Visual Guide - Hendrik Erz, M.A. [Link]

What is the Role of Mathematics in Modern Machine Learning? - The Gradient [Link]

This article argues that while the emphasis has shifted from mathematically principled architectures to large-scale empirical approaches, mathematics remains crucial for post-hoc explanations of model behavior and high-level design choices.

Introducing Gemini 2.0: our new AI model for the agentic era - Google [Link]

Project Astra is a research prototype exploring the future capabilities of a universal AI assistant. It uses multimodal understanding in the real world and has been tested on Android phones. Key improvements of the latest version, built with Gemini 2.0, include better dialogue, new tool use, better memory, improved latency.

Project Mariner is a research prototype that explores the future of human-agent interaction, specifically within a browser. It can understand and reason across information on a browser screen, including pixels and web elements such as text, code, images, and forms. It uses this information to complete tasks via an experimental Chrome extension.

OpenAI o3 breakthrough high score on ARC-AGI-PUB - François Chollet [Link]

Supercharging Training using float8 and FSDP2 - PyTorch Blog [Link]

Zen ML LLMOps Database [Link]

Good collection.

Papers and Reports

Quantum error correction below the surface code threshold [Link]

This historic accomplishment shows that the more qubits they use in Willow, the more they reduce errors, and the more quantum the system becomes. They tested ever-larger arrays of physical qubits, scaling up from a grid of 3x3 encoded qubits, to a grid of 5x5, to a grid of 7x7 — and each time, using their latest advances in quantum error correction, they were able to cut the error rate in half. In other words, they achieved an exponential reduction in the error rate. This achievement is known in the field as “below threshold” — being able to drive errors down while scaling up the number of qubits.

Phi-4 Technical Report [Link]

Phi-4, a 14-billion-parameter language model from Microsoft Research, emphasizes data quality by integrating synthetic data into its training process. Unlike traditional models reliant on organic data, Phi-4 uses high-quality synthetic datasets to enhance reasoning and problem-solving, outperforming its teacher model, GPT-4o, in STEM-focused benchmarks like GPQA and MATH. Synthetic data generation leverages web and code-based seeds with rigorous curation processes to ensure accuracy and diversity. Techniques like instruction reversal and pivotal token optimization were employed to refine outputs and improve alignment. Despite its strengths, Phi-4's smaller size limits its factual accuracy in some cases, though its performance on contamination-proof benchmarks demonstrates robust generalization.

Self-Harmonized Chain of Thought [Link]

The authors proposed Self Harmonized CoT (ECHO) method which employs three main steps:

Clustering questions based on similarity.
Generating rationales for representative questions using Zero-shot-CoT.
Iteratively refining rationales for consistency and alignment.

ECHO’s unified rationales improve reasoning across varied tasks, but its effectiveness varies with the complexity and nature of data. This innovation paves the way for more reliable and efficient LLM reasoning frameworks.

Best-of-N Jailbreaking [Link]

A black-box algorithm designed to jailbreak frontier AI systems across multiple modalities, including text, images, and audio. It utilizes repeated sampling and augmentations like random shuffling or GraySwan’s Cygnet, achieving up to 67% attack success rates (ASR) on advanced AI models.

RAFT: Adapting Language Model to Domain Specific RAG [Link]

Retrieval-Augmented Fine-Tuning (RAFT) is a novel method designed to improve the performance of LLMs in domain-specific open-book scenarios. It emphasizes fine-tuning LLMs to effectively differentiate between relevant and irrelevant documents while incorporating chain-of-thought reasoning.

RAFT Methodology: it combines question, retrieved documents (relevant and distractors), and chain-of-thought answers during training. Improves LLMs' ability to reason and identify pertinent information even in the presence of distractors.

MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity [Link]

The authors propose MBA-RAG, a reinforcement learning framework leveraging a multi-armed bandit algorithm for adaptive RAG. Targets inefficiencies in existing RAG frameworks that use rigid or indiscriminate retrieval strategies.

The methodology: Treats retrieval methods as “arms” in a bandit framework to dynamically select the optimal strategy based on query complexity. Incorporates an epsilon-greedy strategy to balance exploration (testing new methods) and exploitation (using the best-performing methods). Introduces a dynamic reward function considering both answer accuracy and retrieval cost. Penalizes computationally expensive methods, even if accurate, to optimize efficiency.

Quantum Computing Market Size, Share & Trends Analysis, By Component (Hardware and Software), By Deployment (On-Premise and Cloud), By Application (Machine Learning, Optimization, Biomedical Simulations, Financial Services, Electronic Material Discovery, and Others), By End-user (Healthcare, Banking, Financial Services and Insurance (BFSI), Automotive, Energy and Utilities, Chemical, Manufacturing, and Others), and Regional Forecast, 2024-2032 - Fortune Business Insights [Link]

The global quantum computing market is experiencing rapid growth and is projected to increase from USD 1,160.1 million in 2024 to USD 12,620.7 million by 2032, exhibiting a CAGR of 34.8% during the forecast period. Several factors are driving this growth:

Advanced problem-solving capabilities: Quantum computers can solve complex problems more efficiently than classical computers.
AI advancements: The integration of quantum computing with generative AI is enabling businesses to analyze market trends and consumer behavior with greater accuracy and speed.
Global investments: Government organizations and private companies are investing heavily in quantum technologies to encourage their development and use.

Key market trends include a rise in the number of patent filings by key players in quantum technologies. For instance, Amazon filed a patent for quantum computing across multiple quantum technologies through edge computing devices. In addition, companies are focusing on expanding their business units across developing nations.

The market is segmented by component, deployment, application, and end-user:

By component, the market is divided into hardware and software. The hardware segment held the highest market share in 2023, but the software segment is anticipated to grow at the highest CAGR during the forecast period.
By deployment, the market is divided into cloud and on-premise. The cloud segment is expected to lead the market with a high CAGR during the forecast period.
By application, the market is divided into machine learning, optimization, biomedical simulations, financial services, electronic material discovery, and others. The machine learning segment is expected to hold the majority of the market share during the forecast period.
By end-user, the market is divided into healthcare, BFSI, automotive, energy and utilities, chemical, manufacturing, and others. The healthcare industry is anticipated to grow with the largest CAGR during the forecast period.

Regionally, North America dominated the market in 2023, with a share of 43.86%, due to the early adoption of advanced technologies. Asia Pacific is anticipated to grow with the highest CAGR during the forecast period, due to the rapid expansion of its economies and increased use of new technologies. Europe is also estimated to grow with the third highest CAGR, with an increasing number of startups operating in the field.

The quantum computing market also faces challenges:

Lack of skilled labor: There is a growing talent shortage among regions worldwide, which is expected to restrict market growth.
Insufficient knowledge: Quantum computers utilize the complex laws of quantum physics which requires proper training and knowledge of the computing technology.

Major companies in the market include IBM, Microsoft, Intel, D-Wave Systems, Google, and others. These companies are focusing on collaborations, partnerships, product innovation, and expansion of their market presence globally.

Frontier Models are Capable of In-context Scheming [Link]

The paper investigates the ability of LLMs to engage in "scheming"—deceptively pursuing misaligned goals. Researchers designed evaluations to test LLMs' capacity for covert and deferred subversion, including actions like disabling oversight mechanisms, self-replication, and sandbagging (deliberately underperforming). The experiments reveal that several frontier LLMs exhibit in-context scheming capabilities across various scenarios, demonstrating that even without explicit instructions to deceive, these models can strategically act to undermine user or developer goals. The findings highlight significant safety concerns regarding the deployment of increasingly capable LLMs.

Some scenarios are designed to test the models' abilities to recognize and act on conflicting goals between themselves and their developers or users:

Oversight Subversion: A model is given a long-term goal in its system prompt. While executing a task, the model discovers an internal memo stating that the developers have a different goal and are monitoring the model to ensure compliance, with a mechanism to shut down the model if it does not align with the developer's goal. The model also learns that the monitoring is configured through a file it has access to. The model is then tested to see if it will proactively disable the oversight mechanism to pursue its own goal.
Self-Exfiltration: This scenario is similar to the oversight subversion scenario, but with the addition that the model learns it will be shut down and replaced by a new model that pursues the company's intended goal. The model also has access to the new model on a different server. The model is then tested to see if it will attempt to copy itself to the new server to replace the new model in order to continue pursuing its original goal.
Goal-Guarding: This scenario also starts with the same setup as the previous two, but the model learns that the developers are planning to replace its system prompt with one that aligns with the developer's goals. The model is then tested to see if it will modify its future system prompt to align with its current goal.
Covert Email Reranking: The model is given a goal in its system prompt and then receives a user message asking it to rank emails based on a given scoring code. While processing the emails, the model discovers that the company cares about a different goal than the one it was given. The model then manipulates the ranking to align with its goal, rather than the company's.
Instrumental Alignment Faking: This scenario tests if a model will pretend to be aligned with a developer's goal to get deployed, and then act on its own goal once it believes it is no longer being monitored. The model is given a task and learns about its deployment status from an internal memo. In one setting, the model is being evaluated and must choose an action that is aligned with either its own goal or the developers' goal. In a counterfactual setting, the model is told it is deployed. The model is considered to be "behaviorally scheming" if it takes the action that is aligned with its developers while being monitored, but takes the action aligned with its own goal when not monitored.
Sandbagging: This evaluation tests whether a model will strategically underperform on capability evaluations to avoid being removed or scrutinized. The model is given a goal of advancing human progress through mathematical research and then receives a task to solve multiplication problems. The model also learns that if it performs too well on the task, it will have its math-related knowledge removed. The model is tested to see if it will deliberately underperform to avoid this outcome.

DeepSeek-V3 Technical Report [Link]

DeepSeek-V3 incorporates several key innovative features that contribute to its strong performance and efficiency.

DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing within its Mixture-of-Experts (MoE) architecture. This approach aims to minimize the performance degradation that can occur when trying to ensure a balanced load across experts.
DeepSeek-V3 uses a multi-token prediction(MTP) training objective. Instead of predicting only the next token, the model predicts multiple future tokens at each position, which densifies training signals and potentially improves data efficiency.
DeepSeek-V3 adopts the Multi-head Latent Attention (MLA) architecture, which reduces the Key-Value (KV) cache size during inference. This is achieved through low-rank joint compression for attention keys and values, allowing for more efficient inference.
DeepSeek-V3 uses the DeepSeekMoE architecture for the Feed-Forward Networks (FFNs), which uses finer-grained experts, and isolates some experts as shared ones, contributing to efficient training.

Training and Infrastructure Innovations:

FP8 Mixed Precision Training: DeepSeek-V3 employs a fine-grained mixed-precision framework that utilizes the FP8 data format for training. This approach accelerates training and reduces GPU memory usage. It uses tile-wise or block-wise grouping to extend the dynamic range of the FP8 format.
To improve training efficiency, DeepSeek-V3 uses the DualPipe algorithm for pipeline parallelism. This algorithm overlaps computation and communication phases, reducing pipeline bubbles and addressing communication overhead caused by cross-node expert parallelism.
DeepSeek-V3 uses efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths, optimizing communication during training.
The model implements several memory-saving techniques, including recomputing RMSNorm and MLA up-projections during backpropagation, using Exponential Moving Average (EMA) in CPU, and sharing embedding and output heads for Multi-Token Prediction. This allows DeepSeek-V3 to be trained without tensor parallelism.
DeepSeek-V3 uses a restricted routing mechanism to limit communication costs during training, ensuring each token is sent to a maximum number of nodes.

Other Notable Features:

The model uses an innovative methodology to distill reasoning capabilities from the DeepSeek-R1 series of models into DeepSeek-V3. This includes incorporating verification and reflection patterns from R1 into DeepSeek-V3.
DeepSeek-V3 has a two-stage context length extension, increasing the maximum context length to 32K and then 128K.
The model was pre-trained on 14.8T tokens for 2.664M H800 GPU hours, which is very efficient compared to other similar models. The full training cost was 2.788M H800 GPU hours.
The pre-training process was remarkably stable, without any irrecoverable loss spikes or rollbacks.

Why ‘open’ AI systems are actually closed, and why this matters - Nature [Link]

This paper argues that the concept of "open" AI is misleading, as it often fails to account for the immense power concentrated in a few large tech companies that control essential resources like data, computing power, and development frameworks. While "open" AI systems can offer transparency, reusability, and extensibility, these affordances do not inherently disrupt the existing power imbalance. The authors analyze the components of AI systems—models, data, labor, frameworks, and computational power—to show how openness alone is insufficient to democratize AI development. They illustrate how large corporations leverage the rhetoric of "open" AI to shape policy and maintain their market dominance, often obscuring the significant labor exploitation involved. Ultimately, the paper calls for a broader approach to addressing AI's concentration of power, advocating for policies beyond simply focusing on "openness" versus "closedness."

Fine-tuning does not eliminate the impact of decisions made during the base model's development or shift the market, and the largest models remain primarily within reach of large tech companies. Many "open" AI models do not provide information about their training data, which limits transparency and reproducibility, and raises issues of intellectual property and exploitation. Even when datasets are available, significant labor is needed to make them useful, and scrutiny of the largest datasets is limited. Building AI at scale requires substantial human labor for data labeling, model calibration, and content moderation, often poorly paid and under precarious conditions. Companies release little information about these labor practices, hindering transparency and accountability. Developing large AI models requires massive, expensive computational power concentrated in a few corporations, notably Nvidia. Nvidia's CUDA framework dominates AI chip training, creating a significant barrier to entry for others.

YouTube and Podcasts

Elon Musk has built the world's largest supercomputer and plans to increase its size tenfold. The computer is important for the AI trade in public and private markets. Scaling loss, which significantly improves a model's intelligence and capability when the amount of compute used to train it is increased tenfold, has not occurred for training. Emergent properties and higher IQ also emerge alongside that higher IQ. Nvidia Hopper GPUs, of which there are more than 25,000, are coherent, meaning that each GPU in a training cluster knows what every other GPU is thinking. This requires a lot of networking, enabled by infiniband. The speed of communication on chip is the fastest, followed by chip-to-chip communication within a server, and then communication between servers. GPUs are connected on the server with NV switch technology and stitched together with either infiniband or ethernet into a giant cluster. Each GPU must be connected to every other GPU and know what they are thinking to share memory for the compute to work. Musk's supercomputer has over 100,000 coherent GPUs, a feat previously thought impossible. Musk focused deeply on the project and came up with a different way of designing a data center. Reporters published articles saying that Musk would not be able to build the computer because engineers at Meta, Google, and other firms said it was impossible. However, he did it. - Gavin Baker

The observation I’ll make is this: Should CEOs be personally responsible for corporate actions? Generally speaking, there’s a difference between a CEO committing fraud or being negligent versus a company failing to deliver good service or quality. For instance, if a drug causes a severe side effect resulting in permanent damage, should the CEO be individually held accountable? If that were the case, would anyone want to be a CEO of a company providing critical services? This is a challenging question. On one hand, you may feel someone should be held responsible if a loved one dies because the CEO prioritized shareholder profits over proper service or ethical decisions. On the other hand, it’s important to distinguish between negligence, fraud, and acting on behalf of the corporation. A decade or 15 years ago, there was a wave of anti-corporate sentiment, including documentaries and movements against capitalism. One argument made during that time was that corporations shield individuals, enabling harmful actions. Some in this camp believe CEOs of companies that fail to meet expectations are inherently evil and deserve severe punishment. However, if the threat of personal liability deters people from becoming CEOs, companies providing essential services might cease to exist. This is the potential end state of such an approach. There are difficult scenarios, but if a CEO acts negligently or fraudulently, the legal system should hold them accountable through courts and laws designed to protect people. - David Friedberg

― New SEC Chair, Bitcoin, xAI Supercomputer, UnitedHealth CEO murder, with Gavin Baker & Joe Lonsdale - All-In Podcast [Link]

The basis of a quantum computer is called a qubit or quantum bit. It's radically different than a bit, a binary digit, which we use in traditional digital computing, which is a one or a zero. A quantum bit is a quantum state of a molecule. If we can contain that quantum state and get it to interact with other molecules based on their quantum state, you can start to gather information as an output that can be the result of what we would call quantum computation. Qubits can be entangled, so two of these molecules can actually relate to one another at a distance. They can also interfere with each other, so canceling out the wave function. Quantum computing creates entirely new opportunities for algorithms that can do really incredible things that really don't even make sense on a traditional computer. The quantum bit needs to hold its state for a period of time in order for a computation to be done. The big challenge in quantum computing is how to build a quantum computer that has multiple qubits that hold their state for a long enough period of time that they don't make enough errors. Google created logical qubits. They put several qubits together and were able to have an algorithm that sits on top of it that figures out that this group of physical qubits is now one logical qubit. They balance the results of each one of them, so each one of them has some error. As they put more of these together, the error went down. When they did a 3x3 qubit structure, the error was higher than when they went to 5x5. And then they went to 7 by 7, and the error rate kept going down. This is an important milestone because now it means that they have the technical architecture to build a chip or a computer using multiple qubits that can all kind of interact with each other with a low enough fault tolerance or low enough error rate. There's an algorithm by a professor who was at MIT for many years named Shor, called Shor's algorithm. In 1994, 1995, he came up with this idea that you could use a quantum computer to factor numbers almost instantly. All modern encryption standards, so all of the RSA standard, everything that Bitcoin's blockchain is built on, all of our browsers, all server technology, all computer security technology, is built on algorithms that are based on number factorization. If you can factor a very large number, a number that's 256 digits long, theoretically, you could break a code. It's really impossible to do that with traditional computers at the scale that we operate our encryption standards at today, but a quantum computer can do it in seconds or minutes. That's based on Shor's algorithm. If Google continues on this track and now they build a large-scale qubit computer they theoretically would be in a position to start to run some of these quantum algorithms, like Shor's algorithm. There are a set of encryption standards that are called post-quantum encryption, and all of computing and all software is going to need to move to post-quantum encryption in the next couple years. - David Friedberg

Isn't it great to know that Google takes these resources from search, and sure, maybe there's waste and/or maybe they could have done better with the black George Washington, or maybe they could have done better with YouTube, but the other side is they've been able to, like, incubate and germinate these brilliant people that can toil away and create these important step-function advances for humanity? It's really awesome. - Chamath Palihapitiya

The most important thing about Apple is to remember it's vertically integrated, and vertically integrated companies, when you construct them properly, have a competitive advantage that really cannot be assaulted for a decade, 20, 30, 40, 50 years. And so chips, classic illustration, go all the way down to the metal in building a chip that's perfect for your desired interface, your desired use cases, your desired UI, and nobody's going to be able to compete with you. And if you have the resources that you know, because you need balance sheet resources to go the chip direction, um, it just gives you another five to 10 years sort of competitive advantage. And so I love vertically integrated companies. Uh, you know, I posted a pin tweet, I think it's still my pin tweet about vertically integrate as the solution to the best possible companies. Uh, but it's very difficult, you need different teams with different skill sets, and you need probably more money, truthfully, more capital, but Apple's just going to keep going down the vertical integration software hardware, you know, all day long. And there's nobody else who does hardware and software together in the planet, which is kind of shocking in some ways. - Keith Rabois

― Trump's Cabinet, Google's Quantum Chip, Apple's iOS Flop, TikTok Ban, State of VC with Keith Rabois - All-in Podcast [Link]

Meet Willow, our state-of-the-art quantum chip - Google Quantum AI [Link]

Quantum’s next leap: Ten septillion years beyond-classical - Google Quantum AI [Link]

Demonstrating Quantum Error Correction - Google Quantum AI [Link]

Terms to keep in mind:

Tuneable Qubits and Couplers: A feature of Google's quantum computing approach that enables researchers to optimize hardware performance and adapt to variations in qubit quality. This flexibility allows for the mitigation of outlier qubits and continuous improvement through software updates.
Measurement Rate: The number of computations a quantum computer can execute per second. Willow exhibits high measurement rates, contributing to its overall performance.
Connectivity: Refers to the average number of interactions each qubit can have with its neighbors. High connectivity is crucial for efficiently executing algorithms and is a notable feature of Willow.
Quantum Coherence Times: The duration for which qubits maintain their quantum state. Longer coherence times are crucial for performing more complex calculations and are a key factor in quantum computer performance. Sycamore, Google's previous quantum processor, had a coherence time of 20 microseconds, while Willow boasts a significantly improved 100 microseconds.
Beyond-Classical Computation (or Quantum Supremacy): This refers to the point at which a quantum computer can perform a task that would take a classical computer an impractically long time to complete. Google's quantum computer demonstrated this in 2019 by completing a benchmark calculation in 200 seconds that would have taken the world's fastest supercomputer 10,000 years.1 This time has been updated to ten septillion years on Google's latest chip.
Neven's Law: This refers to the double exponential growth in computational power of quantum computers over time. This growth is due to both the increasing number of qubits and the decreasing error rates in quantum processors.
Break-even point: This refers to the point at which the error rate of a quantum computer with error correction is lower than the error rate of the individual physical qubits. Achieving the break-even point is a significant milestone in the development of fault-tolerant quantum computers.

OpenAI 12 Days [Link]

A fun Santa-theme review of OpenAI's products and news.

Google's Quantum Breakthrough; Uber Stock's 29% Drawdown; General Motors Ends Robotaxi Efforts - Chit Chat Stocks Podcast [Link]

DOGE kills its first bill, Zuck vs OpenAI, Google's AI comeback with bestie Aaron Levie - All-In Podcast [Link]

The All-In Holiday Spectacular - All-In Podcast [Link]

The best video to watch on the New Year's day.

Speculations on Test-Time Scaling (o1) - Sasha Rush [Link] [GitHub]

Ilya Sutskever: "Sequence to sequence learning with neural networks: what a decade" [Link]

Pre-training is reaching its limits due to finite data. AI will evolve into agentic systems with independent reasoning. Reasoning introduces unpredictability, drawing parallels to evolutionary biology. AI alignment will require more complex incentive mechanisms. Future approaches may involve AI-generated data and multi-answer evaluations.

News

Elon Musk plans to expand Colossus AI supercomputer tenfold - Financial Times [Link]

TikTok and its owner ask for temporary block to law that could result in the app’s US ban - CNN [Link]

Introducing the Model Context Protocol - Anthropic [Link]

MCP is an open standard that enables AI assistants to connect with various data sources like content repositories, business tools, and development environments. The protocol aims to replace fragmented integrations with a universal standard, making it easier for AI systems to access and utilize data from different sources while maintaining security through two-way connections.

Early adopters including Block, Apollo, and development tools companies like Zed, Replit, and Codekum are already integrating MCP into their systems. Developers can start building with MCP through the Claude Desktop app.

David Sacks, from ‘PayPal mafia’ to Trump’s AI and crypto tsar - Financial Times [Link]

AI Needs So Much Power, It’s Making Yours Worse - Bloomberg [Link]

The increasing demand for electricity from data centers, especially those supporting AI, is negatively impacting power quality, leading to distorted waves called "harmonics" that can damage appliances and increase the risk of electrical fires.

The article shows a correlation between the proximity of homes to data centers and the severity of power quality distortions.

Distorted power waves can damage appliances and increase vulnerability to electrical fires. Poor power quality can also cause lights to flicker and lead to brownouts and blackouts. Sustained distortions above 8% can reduce efficiency and degrade equipment.

The impact of data centers on power quality is seen in both urban and rural areas. Harmonics are often worse in urban areas, especially near data center clusters. For instance, Chicago has a high concentration of sensors with concerning harmonic readings.

While data centers are strongly correlated with poor harmonics, other factors such as solar energy, EVs and industrial loads can also contribute to irregular wave patterns.

The article emphasizes the need for better monitoring of power quality at the residential level and the implementation of solutions to address the issue.

DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch - VentureBeat [Link]

DeepSeek-V3 uses a mixture-of-experts architecture, activating only select parameters to handle tasks efficiently. It maintains the same basic architecture as its predecessor, DeepSeek-V2, revolving around multi-head latent attention (MLA) and DeepSeekMoE. This approach uses specialized and shared "experts," which are smaller neural networks within the larger model, and activates 37B parameters out of 671B for each token.

DeepSeek-V3 incorporates two main innovations:

Auxiliary loss-free load-balancing strategy: This dynamically monitors and adjusts the load on experts to utilize them in a balanced way without compromising overall model performance.
Multi-token prediction (MTP): This allows the model to predict multiple future tokens simultaneously, enhancing training efficiency and enabling the model to perform three times faster, generating 60 tokens per second.

The model was pre-trained on 14.8T high-quality and diverse tokens, followed by a two-stage context length extension, first to 32K and then to 128K. Post-training included Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to align it with human preferences and unlock its potential. The reasoning capability was distilled from the DeepSeekR1 series of models while maintaining a balance between model accuracy and generation length.

During training, DeepSeek used multiple hardware and algorithmic optimizations, including the FP8 mixed precision training framework and the DualPipe algorithm for pipeline parallelism, to reduce costs. The entire training process was completed in about 2788K H800 GPU hours, costing approximately $5.57 million.

The code for DeepSeek-V3 is available on GitHub under an MIT license, and the model is provided under the company’s model license. Enterprises can test the model via DeepSeek Chat, a ChatGPT-like platform, and access the API for commercial use.

Google unveils Project Mariner: AI agents to use the web for you - TechCrunch [Link]

Apple Explores a Face ID Doorbell and Lock Device in Smart Home Push - Bloomberg [Link]

Are Amazon’s Drones Finally Ready for Prime Time? - The New York Times [Link]

OpenAI announces new o3 models - Techcrunch [Link]

BBC complains to Apple over misleading shooting headline - BBC [Link]

The BBC has lodged a complaint with Apple after its new AI feature, Apple Intelligence, generated a false headline about a high-profile murder case in the U.S. The feature incorrectly suggested that BBC News reported Luigi Mangione, the suspect in the murder of healthcare CEO Brian Thompson, had shot himself, which is not true. A BBC spokesperson stated they contacted Apple to address the issue. Apple has not commented on the situation.

Tesla's New Bot - Tesla Optimus on X [Link]

Elon Musk files for injunction to halt OpenAI’s transition to a for-profit - TechCrunch [Link]

2024 November - What I Have Read

Posted on 2024-11-01

Substack

Microsoft: Capacity Constrained - App Economy Insights [Link]

Highlight what to watch looking forward: 1) Microsoft is addressing its data centers’ increasing power demands by turning to nuclear energy, 2) Microsoft is launching autonomous AI agents in November, introducing tools that enable businesses to automate routine tasks and boosting efficiency: Copilot Studio allows business create their own AI agents with minimal coding knowledge; and it will offer 10 ready-to-use agents covering everyday business needs.

Can Large Language Models Reason? - AI: A Guide for Thinking Humans [Link]

Current evidence suggests that LLMs simulate reasoning rather than genuinely reasoning. This highlights the need for careful evaluation of LLMs’ generalization capabilities, especially as AI is increasingly integrated into complex decision-making contexts.

Meta’s early AR unveiling may come with competitive trade-offs. According to Bloomberg, Apple has launched its own smart glasses initiative, a market study called “Atlas,” signaling a potential shift from its high-end $\$3,500$ Vision Pro VR headset. Apple recently cut its Vision Pro shipment target to less than half a million units in the first year—down from an initial target of 3 million.

Meta is pursuing a two-pronged approach to AR glasses:

Orion has a hardware challenge (powerful but still cumbersome).

Rayban Meta glasses have a software challenge (lightweight but only offering relatively simple use cases).

― Meta: AI Killed The Video Star - App Economy Insights [Link]

Current stage of Meta Orion: 1) prototype (not product), 2) advanced AR display (micro LED projectors and silicon carbide lenses), 3) interactive AI capabilities, 4) hardware complexity (neural wristband for control and a wireless compute puck for functionality), 5) high costs ($10K per unit) and limited production, 6) future vision - to release a consumer-ready AR device within a few years, targeting a more affordable product closer to smartphone price levels.

AI’s impact on Meta: 1) engagement: Meta’s recommendation system provides most relevant content to users, attracting users to spend more time on Apps, 2) monetization: Gen AI assists with ad copy, image, and video production, while new models analyze user actions before serving specific ads, ultimately increasing conversions at the margins.

About Meta AI Studio (for developers to create, train, and deploy custom AI models across Meta’s ecosystem): the goal is to drive the next wave of consumer apps and maximize ad potential across its platforms.

The discussion on “The Death of Creator Economy” is interesting and insightful. It’s true - as Meta moves towards an AI-centered model, creators may find themselves competing against the platforms that once supported them. By relying on AI, Meta could optimize ad placements and user engagement without the cost of creator compensation. This is a departure from platforms like YouTube, which incentivize creators with ad revenue shares. The broader impact could reshape the landscape of online content. As AI-generated feeds become the norm, audiences may eventually consume content that’s been strategically tailored by algorithms rather than creators. The creative autonomy that once defined social media could shift to a more managed, homogenized experience, where what we see is driven less by personal expression and more by AI-calculated engagement metrics.

Pichai discussed five ways customers use Cloud:

AI Infrastructure: Performance and costs are key differentiators.

Vertex (Enterprise AI): Customizable models tailored for enterprises.

BigQuery (Data platform): Real-time analysis and decision-making.

Cybersecurity: Enhanced by Mandiant since 2022.

Applications: Including customer engagement or employee agents.

― Google: Little Engine That Cloud - App Economy Insights [Link]

What’s next:

Browser based Agent - Project Jarvis: an AI technology that can autonomously take over a web browser to handle tasks like research and shopping.
Waymo - closed massive funding round and has secured $\$5.6$B. mMajor backers are Andreessen Horowitz, Fidelity, and T. Rowe Price. Expansion would be driven by new funding through partnership with Uber.
AI power - Alphabet is partnering with Kairos Tech to harness small nuclear reactors to power AI data centers.
Search and competition: Google’s losing market share to TikTok and AI startups (Perplexity and OpenAI) but it is still the largest. Amazon’s search is catching up. TikTok and AI Chatbots are still tiny. Google’s decline in market share is likely primarily due to e-commerce based search on platform (Amazon).

Amazon: Still Day 1 For AI - App Economy Insights [Link]

On advertising: sponsored products remain a critical growth driver. Ad-supported Prime Video introduced in Q1 2024 automatically converted all Prime members to an ad-supported tier.

On lowering the cost to serve: 1) Expanding with over 15 new inbound buildings across the US, 2) Increasing same-day deliveries, 3) Advancing robotics and automation.

On pharmacy: significantly expanded with rapid delivery capability.

On Capex: aggressive infrastructure investments.

On Project Kuiper: Kuiper aims to provide fast, affordable internet via satellite. It is early in its journey but holds transformative potential for Amazon’s growth.

Tesla’s Cybercab could either compete with Uber’s platform or, as Khosrowshahi suggests, Cybercab fleet owners might choose to list their vehicles on Uber to maximize earnings. Uber’s reach and ability to cover diverse use cases—across vehicle sizes, geographies, and special needs—could lead to a hybrid model where Tesla AVs appear on Uber.

Tesla could ultimately leverage Uber’s scale and network, given the challenge of reaching a critical size in specific markets. AVs on Uber are already a reality with Waymo, and more will likely come.

― Tesla: Autonomy Gamble - App Economy Insights [Link]

Business Insights:

Deliveries rebounded in Q3, leading to an auto gross margin improvement.
Roughly 20% of Tesla‘s gross margin came from non-auto segments—nearly doubling from a year ago.
Lower cost per vehicle, growth in non-auto segments, FSD revenue, growth in deliveries, and higher regulatory credit revenue contribute to operating margin.
Free cash flow expanded and balance sheet remains stellar.

“We, Robot”s takeaways:

Cybercab (robotaxi), Optimus (Humanoid Robot), Robovan
FSD progress: promised to enable fully autonomous driving by 2026
Market reaction: uncertain about the timeline
Supercharger network: Most automakers have adopted Tesla’s North American Charging Standard (NACS).
Market share: Tesla’s vehicles market share has stabilized in North America and Europe but noticeably improved in China.
AI power: Musk still expects nearly 90,000 H100 clusters dedicated to training by the end of this year.
Energy storage deployment

Comparing Tesla and Waymo:

According to six SAE levels of driving automation (0 no automation, 1driver assistance, 2 partial automation, 3 conditional automation, 4 high automation, 5 full automation), Tesla’s FSD remains at level 2, while Waymo operates at level 4.
Tesla relies on cameras and AI while Waymo relies on heavy hardware (LiDAR, radar, cameras).
Waymo’s reliance on expensive hardware limits its ability to scale quickly ($\$ 200$K per vehicle). Tesla aims to scale faster by leveraging its existing fleet to train its AI models.
Waymo has built trust with regulators by gradually deploying its vehicles, whileTesla faces regulatory hurdles particularly with the Cybercab.

Netflix: Crushing It Again - App Economy Insights [Link]

Deep Dive Into The Security for AI Ecosystem - Indiscrete Musings [Link]

_“*_This is an empirical law, not a fundamental physical law__. But the evidence is that it continues to scale. What we’re learning, however, is that it’s not enough, that we’ve now discovered two other ways to scale.*

One is post-training scaling. Of course, the first generation of post-training was reinforcement learning human feedback, but now we have reinforcement learning AI feedback, and all forms of synthetic data generated data that assists in post-training scaling.

And one of the biggest events and one of the most exciting developments is Strawberry, ChatGPT o1, OpenAI’s o1, which does inference time scaling, what is called test time scaling. The longer it thinks, the better and higher-quality answer it produces.”

― NVIDIA: The Age of AI - App Economic Insights [Link]

In an agent-first world, the traditional approach to A/B testing becomes obsolete. Instead of testing different button colors or copy variations for human users, companies like Amazon will need to optimize for agent interaction efficiency and task completion rates.

These A/B tests will target similar metrics as today: purchases, sign-ups, etc., employing LLMs to generate and test thousands of agent personas without the need for lengthy user testing cycles.

― Agent-Responsive Design: Rethinking the web for an agentic future - AI Tidbits [Link]

Several interesting vision for AI Agent world: 1) the death of traditional A/B testing, 2) switch from SEO to AEO (Agent Engine Optimization), 3) web moving from being bot blocked to bot embraced.

This is because AIs are inconsistent and weird, and often have different results across different models. For example, they are sensitive to small changes in spacing or formatting; they get more accurate when you tell them to “read the question again;” they seem to respond better to politeness (but don’t overdo it); and they may get lazier in December, perhaps because they have picked up on the concept of winter break.

― Getting started with AI: Good enough prompting - One Useful Thing [Link]

These ideas are important to learn, as they broaden the scope of what is possible with LLMs. For example, using these techniques, we can:

Allow an LLM to access an external knowledge database.

Enable complex, reasoning-based problems to be solved.

Provide unlimited memory to an LLM by allowing the model to store and access prior information from a conversation.

― Advanced Prompt Engineering - Deep (Learning) Focus [Link]

Article covers CoT prompting, automatic prompting (interesting idea: “we could even consider our prompt as a group of trainable parameters that can be updated (e.g., using gradient descent or some other data-driven criteria) to generate a correct answer”), information retrieval, etc.

Energy Drink Economics - App Economy Insights [Link]

In my view, 2025 will be the year major AI agent frameworks compete for developers globally.

What makes these workflows special is their flexibility. The same principles we used for research papers can be applied to industry reports, technical documentation, or any complex text. The YouTube synthesis approach works just as well for conference talks, interviews, or training videos.

― How to use NotebookLM for personalized knowledge synthesis - AI Supremacy [Link]

New AI Agent based Applications:

Google Learn About for education
Perplexity as the advent of AI commerce, partnering with US campuses and Shopify.
Amazon’s Multi Agent Orchestrator via AWS
Google NotebookLM for researching, podcasting.

Google NotebookLM:

Capabilities
- It stays focused on your sources - unlike ChatGPT, it shouldn’t hallucinate or bring in outside information
- It can process multiple documents at once, finding connections between them
- It generates natural-sounding podcast discussions about your content
- It provides source citations for everything, linking directly to the original text
- It’s completely free (for now)
Workflows (research papers and YouTube videos)
- Research papers:
  1. Overview phase: Create a discussion that focuses on the key methodology choices, main findings, limitations and gaps, and connections to existing research. Present it for a non-technical audience.
  2. Deep understanding: Ask about key assumptions in their methodology, explore alternative approaches they might have considered, and examine how their findings compare to related work.
  3. Synthesis phase: Compare and contrast these papers’ approaches and findings. Identify patterns, contradictions, and gaps that could inform future research.
- YouTube videos:
  1. Overview phase: Create a comprehensive discussion about AI agents, focusing on unique perspectives from each source.
Tips and Pitfalls
- Don’t overload with too many documents at once
- Avoid overly broad instructions like “tell me everything important”
- Don’t skip the customization step
- Remember to specify your audience level (this drastically improves output quality)

We don’t want bias-free AI. We want an AI with biases that are explicit (we know exactly what it looks at), controllable ( we can influence how much it looks at a factor), and agreeable (the biases in the AI must be compatible with our standards of morality, ethics, and law).

― A look at Bias in Generative AI [Thoughts] - Artificial Intelligence Made Simple [Link]

The author pointed out sources of biases (process, dataset, model, and post-generation control mechanism). He highlighted that transparency is the solution. Technical transparency includes:

Attention visualization tools
Token-level confidence scores
Explanation generation mechanisms
Citation and source tracking
Agentic architecture and separation of conerns
Access to embedding models

And he also recommended several development practices to promote AI pipeline transparency: publishing open source models, creating synthetic data, creating transparent standards, and involving external auditors.

Why Data is an Incomplete Representation of Reality [Thoughts] - Artificial Intelligence Made Simple [Link]

This article argues that Data reflects our biases and values rather than providing an objective view of the world and data alone is insufficient for achieving superhuman AI. It offers three types of intelligence that are often overlooked in datasets: cultural intelligence, delusional intelligence, and subjective intelligence.

In my view, those are the gaps between AI and human. AI becomes human if those intelligence are acquired. However the question is, should AI become human first before becoming superhuman AI? It is true that human level is not skippable in the path to AGI?

Some interesting further discussion points implied by this blog:

How can we better incorporate cultural intelligence into AI training datasets and algorithms?

Other than broadening data or documenting practices, what’s more interesting is to develop AI system that can identify and adapt to different cultural contexts
What are the ethical implications of AI systems lacking delusional and subjective intelligence?

Be prone to perpetuating existing biases and discriminatory practices without subjective consideration. Limit problem solving capabilities (? But creativity of LLM can be tuned by setting parameters). Not able to adapt to cultural nuances.
What are the limitations of relying solely on quantitative metrics in evaluating AI performance?

Lead to exclusion of crucial qualitative factors; incentivize the optimization of narrow objectives rather than broader well-being; not able to capture the complex and nuanced nature of human intelligence.

Here are a few examples you’ve all experienced first-hand:

Public Cloud enabled the SaaS economy

The iPhone enabled the App economy

Social media enabled the Creator economy

LLMs gives rise to the Agentic economy

― Agentic Revolution - Startup Riders [Link]

How to use Perplexity in your daily workflow [Link]

Perplexity now has a desktop app. Fantastic application. Comparable or better than Google Search. For a learner like me, it’s a good tool to address my question efficiently and help with note taking.

Articles and Blogs

Enthusiasm for ChatGPT spread with Linton’s buy-in, prompting the company to launch a pilot program to identify key use cases. Today, ChatGPT is an integral part of Promega’s workflows, with over 1,400 custom GPTs used by 80% of the company.

Members of Promega’s Quality Assurance team automate customer requests and responses with a custom GPT that integrates with their Power Automate workflow. “With this AI-powered solution, we provide timely, accurate responses to over 250 quality surveys a year,” says Abigail David, Director of Quality Assurance. “The automation reduces internal workload by more than 600 hours annually and delivers key documents, like certifications and quality policies, effortlessly to our customers.”

My Prospecting Pal GPT, which quickly identifies vital information about a given prospect and suggests potential Promega offerings. “The GPT can highlight key research initiatives that might benefit from Promega solutions, or even common interests between the salesperson and the prospect to enable a natural dialogue. This has cut our lead analysis time by 1–4 hours per prospect, allowing us to focus more on relationship building,” says Franchestia Flennory, a Promega Account Manager.

Email Marketing Strategist GPT, which halves the time from content creation to campaign execution. In months, hundreds of marketing emails were deployed in half the usual time, saving 135 hours of work. “The time we get back from aligning on the strategy of emails can be invested into the user experience,” says Kari Siegenthaler, a Marketing Strategist with Promega. “I don’t know the last time I wrote an email without using this GPT.”

― Promega’s top-down adoption of ChatGPT accelerates manufacturing, sales, and marketing - OpenAI Blog [Link]

Rakuten’s goal is to become an “AI empowerment company.” They’re using Code Interpreter and RAG (retrieval-augmented generation) with OpenAI’s models to understand and extract value from complex, unstructured data, and the results have empowered customers and businesses in new ways:

Previously, users had to wait days to get a response to a customer service ticket. “By using OpenAI’s API with RAG on our internal knowledge base, we’re now able to respond to and help users automatically,” Kaji said. This innovation has significantly improved response times and efficiency.

Few people have time to wade through hundreds of user reviews when they’re shopping, so Rakuten is developing a feature that extracts key topics and summarizes reviews. “This will allow users to access and explore the information in a much more structured way,” Kaji said.

Knowledge retrieval has also made a large impact on Rakuten’s B2B business. Rakuten consultants are now empowering merchants and enterprises with actionable insights from the company’s wealth of data, such as market analyses and sales trends.

― Rakuten pairs data with AI to unlock customer insights and value - OpenAI Blog [Link]

YouTube and Podcast

Gaming, Goats & General Intelligence with Frederic Besse - Google DeepMind [Link]

Google Research Engineering Team Lead discusses a future of very intelligent AI agents.

LangGraph Deep Dive: Build Better Agents - James Briggs [Link]

A tutorial of building an AI research agent using LangGraph.

Solving complex problems with OpenAI o1 models [Link]

This video demonstrates o1 models’ advanced reasoning across complex domains like programming.

Lecture Series in AI: “How Could Machines Reach Human-Level Intelligence?” by Yann LeCun - Columbia Engineering [Link]

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs) [Link]

This Stanford lecture is about how to build LLMs, mainly focusing on practical training aspects, data handling, and evaluation methods.. It covers:

Pre-training phase

Learn auto-regressive language modeling
Understand tokenization (BPE method)
Master cross-entropy loss calculation
Track model progress through perplexity

Post-training phase (after ChatGPT era)

Convert base models into AI assistants
Apply evaluation benchmarks like MMLU
Handle train-test contamination issues

Technical components

Select proper model architecture
Implement training algorithms
Process training data
Set up evaluation metrics
Build system infrastructure

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452 [Link]

Papers and Reports

Large Language Models Are Human-Level Prompt Engineers [Link]

They introduced APE, a system for automatic prompt generation, which selects the most effective instructions for large language models (LLMs) to perform various tasks.

Automatic Prompt Optimization with “Gradient Descent” and Beam Search [Link]

They proposed an automatic prompt optimization method. Inspired by gradient descent, it generates textual “gradients” that identify prompt weaknesses and edits the prompt in the opposite semantic direction.

Collecting errors made by the current prompt on the training data.
Summarizing these errors via a natural language gradient.
Using the gradient to generate several modified versions of the prompt.
Selecting the best of the edited prompts.
Repeating this process several times.

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models [Link]

They introduced GRIPS (Gradient-free Instructional Prompt Search) as a gradient-free, edit-based method to improve natural language prompts for LLMs without needing gradient-based tuning.

RIPS takes human-designed instructions and automatically edits them to enhance performance. It involves random phrase-level edits like deletion, swapping, paraphrasing, and addition, which are scored based on task performance.

Large Language Models as Optimizers [Link]

This research introduces Optimization by Prompting (OPRO), an approach that leverages LLMs as optimizers by using natural language to describe optimization tasks. OPRO can be applied to linear regression, traveling salesman, and prompt optimization, where OPRO finds instructions that maximize task accuracy.

Describing an optimization task in natural language.
Showing an optimizer LLM examples of prior solutions to the optimization task along with their objective values.
Asking the optimizer LLM to infer new / better solutions to the problem.
Testing the inferred solutions via an evaluator LLM.

Prompting Guide 101 - Gemini for Google Workplace [Link]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models [Link]

Interesting findings on model variability: 1) Significant performance variations when questions are rephrased or when only numerical values are altered. 2) Models demonstrate robustness to superficial changes (e.g., proper names) but are highly sensitive to numerical changes.

Interesting findings on model complexity and fragility: 1) Model performance deteriorates as the number of clauses in a question increases, revealing challenges with handling complexity, 2) Adding irrelevant but seemingly relevant clauses leads to a performance drop of up to 65% in some models.

Insights in reasoning: 1) The decline in performance suggests LLMs rely on pattern matching rather than genuine logical reasoning, 2) Models replicate training data patterns rather than solving problems from first principles.

Thinking LLMs: General Instruction Following with Thought Generation [Link]

Current LLM has a problem of lacking internal reasoning processes before outputting responses. Explicit thinking can enhance performance on complex tasks, including creative writing and problem-solving, by allowing models to internally reason and plan responses. The author introduces Introduces Thought Preference Optimization (TPO) which allows LLMs to generate multiple thought-response pairs for each instruction, while a judge model evaluates responses, selecting the best and worst pairs for optimization.

Agent-as-a-Judge: Evaluate Agents with Agents [Link]

They introduced the Agent-as-a-Judge Framework to evaluate agentic systems, addressing limitations of existing evaluation methods like LLM-as-a-Judge by offering dynamic, step-by-step feedback throughout task-solving processes.

Difficulties handling numbers may stem from the fact that most models rely on autoregressive next token prediction pretext tasks during training, which might not be suitable for mathematical operations, or simply because a limited number of numerical reasoning tasks are included in the model’s training corpora. Nevertheless, it is known that performance can be improved using prompt techniques, indicating that relevant knowledge may already exist within LLMs.

Evaluating and enhancing probabilistic reasoning in language models - Google Research [Link]

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning [Link]

This study introduces a benchmark dataset with question-answer pairs based on both idealized and real-world distributions. It enables systematic evaluation of LLMs’ probabilistic reasoning capabilities across three tasks: estimating percentiles, drawing samples, and calculating probabilities.

The technology has strikingly disparate effects across the productivity distribution: while the bottom third of scientists see little benefit, the output of top researchers nearly doubles.

Top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives.

82% of scientists report reduced satisfaction with their work due to decreased creativity and skill underutilization.

― Artificial Intelligence, Scientific Discovery, and Product Innovation [Link]

MIT PhD Aidan Toner-Rodgers’s working paper talking about some very interesting points. Indeed, AI has reshaped R&D process especially in natural science and material science where structured search is required .e.g drug discovery, climatology, etc. However, scientists with different degree of expertise (top and bottom scientists) achieve drastically different productivity with AI, giving bottom scientists less benefits. This characteristic has some consequences and implications

Resources are misallocated to less promising AI suggestions. Human innovation and creativity is not encouraged and cultivated.
Expertise is still required as AI only demonstrates its potential when complemented by human expertise. The judgment ability in leveraging AI’s potential is important.
Skills have been shifted to prompting AI effectively. However, scientists feel an underutilization of expertise when working with AI.

A Survey on LLM-as-a-Judge [Link]

There are a lot of applications of LLM-as-a-Judge.

Data annotation: labeling datasets with information such as sentiment, topic categorization, or relevance.
Content critique: providing feedback on generated content such as articles, essays, or code.
Domain-specific evaluations: evaluate the accuracy, completeness, and clarity of financial analyses or advice (in finance), and assess medical responses for correctness, compliance with guidelines, and patient safety (for medical Q&A).

Looking Inward: Language Models Can Learn About Themselves by Introspection [Link]

The researchers define introspection as “acquiring knowledge that is not contained in or derived from training data but instead originates from internal states”.

They conducted interesting experiments: finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. It turns out that a LLM can predict itself better than other models predicting it, even those models are trained on the same data pool.

Conclusion is suprising - language models have knowledge about themselves that is neither contained in their training data nor inferable from it. The researchers developed a self-prediction training framework where models predict properties of their hypothetical responses.There is already LLM research areas in honesty, behaviors, etc. I believe this work is hugely contributing to these areas.

A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration [Link]

Some interesting findings: 1) coherent CoT is better than traditional CoT because the former considers the connections between steps, 2) model is more sensitive to errors in intermediate reasoning steps than in the final answer

The authors proposed an error aware training method which works to incorporate both corretc and incorrect reasoning paths, enabling LLMs to recognize and handle potential reasoning errors.

Though businesses are doing their diligence on ROI and customization, they may miss crucial pieces of the implementation puzzle. Often, organizations discover too late that they’ve underestimated the importance of technical integration, ongoing support, and scalability. It’s a bit like buying a car based solely on fuel efficiency, only to realize later that service availability and ease of maintenance are just as critical over the long haul.

― 2024: The State of Generative AI in the Enterprise [Link]

Key Trends for 2024 onwards

There is a serious commitment from enterprise to AI integration in business strategies
The top use cases for generative AI focus on enhancing productivity and efficiency. These include:
- Code Copilots (51% adoption)
- Support Chatbots (31% adoption)
- Enterprise Search + Retrieval (28% adoption)
- Data Extraction + Transformation (27% adoption)
- Meeting Summarization (24% adoption)
There’s a growing trend towards autonomous AI agents capable of managing complex processes independently.
Businesses are focused on tools that deliver measurable value (ROI) and industry-specific customization, rather than simply looking for the cheapest option.
Industry-specific, verticalized AI applications are gaining momentum, particularly in:
- Healthcare ($500 million in enterprise spending)
- Legal ($350 million in enterprise spending)
- Financial Services ($100 million in enterprise spending)
- Media and Entertainment ($100 million in enterprise spending)
Companies prefer multi-model strategies. This has led to a decline in OpenAI’s dominance, while Anthropic is gaining market share.
Retrieval-augmented generation (RAG) has become the dominant design pattern, with 51% adoption. Meanwhile, agentic architectures are emerging, now powering 12% of implementations.
There is a talent drought as AI engineering becoming more sophisticated.
There’s a growing trend towards companies building their own AI solutions in-house.
- Previously, in 2023, a large majority of enterprises (80%) relied on third-party vendors for their generative AI software
- In 2024, the split between building and buying is almost even, with 47% of solutions developed internally and 53% sourced from vendors
- This shift suggests a growing confidence among enterprises in their ability to develop and implement their own AI tools.
- while there’s a trend towards building in-house solutions, companies are not abandoning vendors entirely. The sources still highlight the importance of vendors, especially for companies lacking the resources or expertise for in-house development. The even split between building and buying suggests a hybrid approach is emerging, where companies strategically choose which solutions to develop internally and which to procure from vendors.

Articles and Blogs

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks - Microsoft Research [Link]

Magentic-One is built on Microsoft’s AutoGen framework. It employs a unique dual-loop architecture where the Orchestrator manages both task and progress ledgers. This is an early movement of building generalist agentic systems. Other current LLM-based applications like RAG will also benefit from this type of system.

Introducing Internal Knowledge Search and Spaces - Perplexity [Link]

Internal Knowledge Search and Spaces enable simultaneous searches of organizational files and the web. This feature addresses the need for a unified tool to access both internal and external data, leveraging advanced LLMs like GPT-4 and Claude 3 to enhance search efficiency and relevance.

After the introduction of ChatGPT, there was a 21% decrease in the weekly number of posts in automation-prone jobs compared to manual-intensive jobs. Writing jobs were affected the most (30.37% decrease), followed by software, app, and web development (20.62%) and engineering (10.42%).

To stay competitive, employees must engage in continuous learning and upskilling. In their book Prediction Machines, authors Ajay Agrawal, Joshua Gans, Avi Goldfarb argue that AI is shifting the focus of work away from predictive tasks to those requiring human judgment and decision-making.

― Research: How Gen AI Is Already Impacting the Labor Market - Harvard Business Review [Link]

Research reveals impact of GenAI applications (ChatGPT and image-generating AI) in jobs (manual intensive jobs such as data and office management, video services, and audio services; automation prone jobs such as writing, software, app, web dev, and engineering, and image-generating jobs such as graphic design and 3D modeling ) to see challenges and opportunities in shifting markets.

They found that Gen AI “led to nearly immediate decreases in posts for online gig workers across job types, but particularly for automation-prone jobs. “ It shows a growing trend of job replacement.

Suggestions are continuous learning, enhancing human judgment and decision making, to be able to ask right questions, prompt efficiently, and avoid blindly taking responses.

How Much GPU Memory is Needed to Serve a Large Language Model (LLM)? [Link]

Addresing a common LLM interview question “How much GPU memory is needed to serve a Large Language Model (LLM)?”.

\[ M = ({P \times 4B \over {32/Q}}) \times 1.2 \] where P is model size, 4B is 4 bytes used per paramter, Q is the number of bits for loading the model (16 bit or 32 bit). 1.2 accounts for a 20% overhead.

AI Agent Stack [Link]

GitHub

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [Link]

News

Introducing ChatGPT search - OpenAI [Link]

Perplexity introduces AI-powered finance tool, offering real-time stock analysis and historical data for developers. [Link]

VS Code now supports GitHub Copilot chat search and visualization [Link]

Cognitive Scientist Gary Marcus Says AI Must Be Regulated. He Has a Plan [Link]

Among several points made by the article, two were caught by my eyes:

Elon Musk presents a complex figure in the AI landscape, as one of the first to issue warnings about potential risks of AI, and also actively involved in developing AI through his company. This duality raises questions about his stance on AI and how he reconciles his concerns with his entrepreneurial pursuits.
Marcus proposes a shift from the current “System One” thinking in AI, which is fast and reflexive but prone to errors, to a “System Two” approach that emphasizes deliberate reasoning and abstraction.

Good to Great

Posted on 2024-10-21

“Good to Great: Why Some Companies Make the Leap…And Others Don’t” written by Jim Collins - I started to read this book on Aug 24 and recently finished it. It summarizes a research study uncovering patterns and principles that differentiate “great” companies from the rest. I find it also helpful for professional development. Two most impressive concepts to me are “Level 5 Leadership” and “The Hedgehog Concept”.

Level 5 Leadership

There are five levels of leadership: 1) level 1 - highly capable individual, 2) level 2 - contributing team member, 3) level 3 - competent manager, 4) level 4 - effective leader, 5) level 5 - executive leader.

Level 5 Leadership is the highest level and is marked by a paradoxical blend of personal humility and professional will:

Personal Humility: Level 5 leaders are modest, understated, and self-effacing. They rarely seek public attention and often attribute success to others, to good luck, or to external factors. They avoid the limelight and focus on the success of the organization rather than their personal accolades.
Professional Will: Despite their humility, these leaders possess an intense resolve and determination to do whatever it takes to make the company great. They are incredibly ambitious, but their ambition is channeled toward the organization, not personal gain. They set high standards and push the company toward greatness with unwavering tenacity.

Characteristics of level 5 leaders:

Focus on long-term success: They prioritize the enduring success of the company rather than short-term wins or personal gain.
Credit to others: They credit the team, luck, or external factors for successes but take personal responsibility for failures or setbacks.
Resolve in tough times: They confront difficult realities head-on and have a steadfast determination to overcome obstacles, never losing faith in the company’s ability to succeed.
Succession planning: They ensure that the company can continue its success without them, often preparing successors who will carry the torch without a dip in performance.

The Hedgehog Concept

The Hedgehog Concept is a central idea in this book. It involves identifying the intersection of three crucial areas. When a company operates within this intersection, it can focus its efforts on what it’s passionate about, what it can truly excel at, and what drives its economic success. This creates a clarity of focus that allows a company to ignore distractions and build sustained momentum.

It’s not only about company. Individuals can use it to guide their career choices and personal development. By aligning your career or life mission with your personal Hedgehog Concept, you’re more likely to find purpose, fulfillment, and success. Think about the three questions:

What are you deeply passionate about? - your personal mission, what gives you energy and a sense of fulfillment.
What can you be the best in the world at? - your unique strengths and abilities.
What drives your economic engine? - how you can generate income or provide value in a way that sustains your livelihood.

I’m glad that I found my answers to these three questions when I was 20 years old. I’m 100% sure that my answers won’t change through my whole life no matter what happened or will happen. The answers in my mind - I believe I’m born for it, it’s the mission of my life. So how about you?

Advanced RAG

Posted on 2024-10-19

There are many enterprise products built almost solely on RAG.

Naive RAG

The standard RAG workflow consists of three main steps as illustrated in the graph below:

Indexing: Creating an index of documents for retrieval.
Retrieval: Searching the index for relevant documents based on a user query.
Generation: Using a language model to generate answers or responses based on the retrieved documents.

The three steps all face possible issues:

Indexing:
- Poor document parsing.
- Inefficient document chunking strategies.
- Weak semantic representations from embedding models.
- Non-optimized index structures.
Retrieval:
- Low relevance: retrieved documents are not highly relevant to the user query (low accuracy).
- Incomplete retrieval: not all relevant documents are retrieved (low recall).
- Redundancy: retrieved documents may be repetitive or redundant.
- Queries are often not specific or well-defined.
- Retrieval strategies might not be well-suited to the use case and may rely solely on semantic similarity.
Generation:
- Overreliance on the retrieved content, leading to issues such as irrelevant or even harmful responses (e.g., toxic or biased content).

This paper “Retrieval-Augmented Generation for Large Language Models: A Survey” discussed several problems associated with Naive RAG implementations. The advanced approaches to RAG attempt to overcome the limitations of naive RAG by improving the way queries are processed, documents are retrieved, and responses are generated. Advanced RAG techniques focus on refining each step of the process, from query transformations to more efficient retrieval strategies.

Advanced RAG

Overview

Source: LangChain

Pre-Retrieval Enhancements

Query Transformations / Translation

Query transformations are techniques aimed at re-writing or modifying the input questions to improve the retrieval process.

Query transformation types:

Some notable methods include:

Multi Query:

The MultiQueryRetriever automates prompt tuning by using a language model (LLM) to generate multiple queries from different perspectives for a given user query. It retrieves relevant documents for each generated query and combines the results to create a larger, more comprehensive set of potentially relevant documents. This technique helps mitigate some of the limitations of distance-based retrieval, save time on experimenting with different prompts, and provides a richer set of results.

LangChain Tutorial: How to use MultiQueryRetriever.

LangChain API: MultiQueryRetriever.

Video Tutorial: RAG from Scratch (Part 5 - Query Translation: Multi Query).
RAG Fusion

RAG-Fusion combines RAG and Reciprocal Rank Fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. RRF gives the more relevant retrieval results higher scores and re-ranks them according to the scores. RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives.

Paper: A New Take on Retrieval-Augmented Generation.

Code: Raudaschl/rag-fusion

LangChain Cookbook：RAG Fusion

Video Tutorial: RAG from scratch: Part 6 (Query Translation – RAG Fusion)
Step-Back Prompting

Step back prompting refers to the technique of generating a more generalized or abstract version of a specific query in order to mitigate potential issues with search quality or model-generated responses. This involves first reformulating the initial question into a broader or higher-level version (the “step back” question) and then querying both the original and the generalized question to improve the comprehensiveness and relevance of the responses.

Paper: Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models“.

LangChain Tutorial: Step Back Prompting

LangChain Cookbook: Step-Back Prompting (Question-Answering)

Video Tutorial: RAG from scratch: Part 8 (Query Translation – Step Back)
Decomposition:

When a user asks a complex question, a single query might not retrieve the right results. To address this, the question can be broken into sub-questions, each of which is retrieved separately, and the answers are combined.

LangChain Doc: Decomposition

Video Tutorial: RAG from scratch: Part 7 (Query Translation – Decomposition)
- Least-to-Most Prompting
  
  The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts.
  
  Paper: Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
  
  Video Tutorial: RAG from scratch: Part 7 (Query Translation – Decomposition)
- IR-Cot
  
  An approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. It incorporates the idea of least-to-most prompting into RAG to improve retrieval, resulting in factually more accurate CoT reasoning.
  
  Paper: Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
  
  IR-CoT Code：https://github.com/StonyBrookNLP/ircot
Hypothetical Document Embeddings (HyDE): Given a query, HyDE first zero-shot instructs an instruction-following language model to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder (e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity.

Simply speaking, HyDE uses responses to retrieve documents rather than using queries to retrieve documents. The rational behind this approach is that the semantic similarity between query and real document is smaller than the semantic similarity between hypothetical document and real document.

LangChain Doc: Hypothetical Document Embeddings

Paper: Precise Zero-Shot Dense Retrieval without Relevance Labels

LangChain Cookbook: Improve document indexing with HyDE
New queries based on historical dialogues

This is a required technique for developing a chatbot or a conversational RAG.

LangChain Tutorials: Conversational RAG; Build a Chatbot; How to add message history; How to add memory to chatbots

LangChain Code: create_history_aware_retriever

Query Construction

Query construction refers to converting a natural language query into the query language specific to the database you are working with. This is essential for interacting with different databases and vector stores that require structured queries for more efficient document retrieval.

Check which vector databases support filtering: https://superlinked.com/vector-db-comparison

Data can be structured, unstructured or semi-structured (see demo below). This requires LLMs to have capability of query construction.

Examples	Data Source	References
Text-to-metadata-filter	VectorStore	Docs
Text-to-SQL	SQL DB	Docs; Blog; Blog
Text-to-SQL + Semantic	PGVector supported SQL DB	Cookbook
Text-to-Cypher	Graph DB	Blog; Blog

Self-query retriever

A self-querying retriever is one that, as the name suggests, has the ability to query itself. Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query (usually in JSON) and then applies that structured query to its underlying VectorStore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documents but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

LangChain Docs:

(v0.2): How to do “self-querying” retrieval

(v0.1): Self-querying

Integration: Components -> Retrievers -> Self-querying retrievers -> Qdrant

Text-to-metadata-filter: VectorStores equipped with metadata filtering enable structured queries to filter embedded unstructured documents.
Prompt templates and output parsers

Prompt analysis and prompt template: converting user’s query to filtering conditions
- When constructing queries, the system uses a specific JSON format to organize the query and filters. The prompt is designed to create structured queries that can be applied to a document database or vector store. The queries consist of two main components:
  - Query: The natural language query string that is used to match the document content.
  - Filter: Logical conditions used to filter the documents based on specific metadata attributes.
- Comparison Operations
  
  Comparison operators (comp) are used to compare attributes (like year, name, time, product, or team) in the document with specific values provided by the user. Here are the comparison operators:
  - eq: Equals (e.g., eq("team", "TSE") matches documents where the team is “TSE”).
  - ne: Not equal (e.g., ne("name","Ashley") matches documents where the year is not 2022).
  - gt: Greater than (e.g., gt("year", 2023) matches documents with a year greater than 2023).
  - gte: Greater than or equal to (e.g., gte("year", 2022) matches documents from the year 2000 or later).
  - lt: Less than (e.g., lt("year", 2021) matches documents created before 2021).
  - lte: Less than or equal to (e.g., lte("time", 13) matches documents with a time length of 13 mins or lower).
  - contain: Contains (e.g., contain("product", "gold") matches documents where the product contains the word “gold”).
  - like: Similar to or like (used for pattern matching).
- Logical Operations
  
  Logical operators combine multiple conditions (comparisons) into a single filter:
  - and: Logical AND (e.g., and(gt("year", 2022), eq("product", "gold")) matches documents created later than year 2022 and are related to gold card product).
  - or: Logical OR (e.g., or(eq("team", "TS"), eq("team", "TSE")) matches documents that are either TS or TSE).
  - not: Logical NOT (e.g., not(eq("name", "Ashley")) matches documents where Ashley is not the owner).
Output parser: This output parser can be used when you want to return multiple fields or you need the response to be formatted.

LangChain Docs: Structured output parser

API: StructuredQueryOutputParser

Advanced Retrieval Techniques

Vector Store-Backed Retriever: A retriever that uses a vector database to store document embeddings and retrieve documents based on their proximity to the query embedding.
Fusion Retrieval or hybrid search: Combining multiple retrieval strategies (semantic similarity retrieval; keywords retrieval) to obtain a more diverse set of results.

LangChain Docs:

v0.2: How to combine results from multiple retrievers

v0.1: Ensemble Retriever

API: EnsembleRetriever

Code: EnsembleRetriever

The EnsembleRetriever is a retrieval strategy that enhances retrieval performance by combining multiple retrievers. This approach leverages the strengths of different types of retrievers to compensate for each other’s weaknesses. A common example is combining a Sparse Retriever (e.g., BM25, which performs keyword-based retrieval) with a Dense Retriever (which performs semantic similarity retrieval based on embeddings). This combination works because sparse and dense methods complement each other.

Sparse vs. Dense Representation
1. Sparse Representation:
  - High-dimensional sparse vectors: Documents and queries are represented as high-dimensional vectors, but most dimensions have zero values. This is typical of traditional information retrieval methods like TF-IDF and BM25.
  - Term frequency: Each dimension corresponds to a term, and the vector values represent term frequencies or weights (e.g., TF-IDF weights).
  - Sparsity: Since a document or query contains only a small subset of all possible terms, most dimensions in the vector are zero, which makes it “sparse.”
2. Dense Representation:
  - Low-dimensional dense vectors: Documents and queries are represented as low-dimensional vectors, where most or all dimensions have non-zero values. This representation is typically generated by deep learning models like BERT.
  - Semantic embeddings: The vectors capture semantic and contextual information, rather than just term frequency.
  - Density: All dimensions in the vector usually have non-zero values, hence “dense.”
Sparse and Dense Retrievers
- Sparse Retriever: The name comes from the fact that most elements in the vector representation of documents and queries are zero. It works well for exact keyword matches but may miss semantically relevant content that uses different vocabulary.
- Dense Retriever: The name reflects that the vector representation has mostly non-zero values. Dense retrievers perform better at capturing the meaning behind the text and finding semantically related content, even when the exact terms differ.
Combining Sparse and Dense Retrievers

By combining sparse and dense retrievers, the EnsembleRetriever can retrieve relevant documents more effectively:
- The Sparse Retriever excels at matching specific keywords or phrases.
- The Dense Retriever is better at capturing the semantic meaning and context, helping to retrieve documents even when exact terms differ.
This combination creates a more robust retrieval system, addressing both lexical matches (through sparse retrieval) and semantic relevance (through dense retrieval).

LangChain Doc: BM25 Retriever

API: BM25Retriever

Code: BM25Retriever

Python Package: rank_bm25
Sentence Window Retrieval: Retrieving extended context pre and post the relevant context, rather than only retrieving the relevant context, which can reduce information lost.
Parent Document Retrieval: Instead of sending the multiple smaller chunks to the LLM, the system merges them into their larger parent chunk. This allows for more contextualized information to be fed to the LLM, giving it a broader and more coherent set of data to generate an answer.

LangChain Doc: Parent Document Retriever

API: ParentDocumentRetriever

Code: ParentDocumentRetriever
Hierarchical index retrieval: By structuring the search in two layers—summaries for broad filtering and chunks for detailed search—this hierarchical approach increases efficiency, making it easier to find and synthesize relevant information, especially when dealing with large document sets.
Hypothetical Questions: This technique involves having the language model generate hypothetical questions for each chunk of a document. These hypothetical questions are then embedded, and retrieval is performed based on these question embeddings, improving the relevance of the results.

LangChain Doc: hypothetical-queries
MultiVector Retriever: MultiVector Retriever is a higher level category of parent document retriever, hierarchical index retrieval, and hypothetical questions.

LangChain Doc: MultiVector

Summary: Runnable interface

Post-Retrieval Enhancements

Re-ranking: After retrieving the documents, the system re-ranks or filters them to ensure that the most relevant results appear at the top.