Substack

A Parquet file is composed of Row Groups, Column Chunk, and Pages.

Parquet is a self-described file format that contains all the information needed for the application that consumes the file. This allows the software to efficiently understand and process the file without requiring external information. Thus, the metadata is the crucial part of Parquet. They include Magic Number, FileMetadata, and PageHeader.

Google Dremel (the query engine behind BigQuery) inspired Parquet’s approach to implementing nested and repeated field storage. In a 2010 paper introducing Dremel, Google detailed its method for efficiently handling nested and repeated fields in analytics workloads using definition level (for nested fields) and repetition level (for array-like fields). I wrote an article about this approach seven months ago; you can read it here:

― I spent 8 hours learning Parquet. Here’s what I discovered - Vu Trinh [Link]

The overall BigQuery architecture includes independent components for query execution, storage, a container management system, and a shuffler service:

  • Colossus: A distributed storage system that holds and stores data.
  • Dremel: The distributed query engine.
  • Borg is Google’s large-scale cluster management system that can reliably manage and orchestrate compute resources. (Borg is the predecessor of Kubernetes.) We will return to Borg when discussing the Vortex architecture.
  • Dedicate shuffle service: Dremel was inspired by the map-reduce paradigm to operate and manage the data shuffle between stages efficiently; Google built a separate shuffle service on top of disaggregated distributed memory. This service backs BigQuery and supports other services, such as Google Dataflow.

― I spent 4 hours learning the architecture of BigQuery's storage engine - Vu Trinh [Link]

  1. Extract: The process’s first step is extraction. The needed data is gathered from various sources, such as relational databases or third-party APIs
  2. Transform: Extracted data undergoes many potential transformations, including cleaning, filtering, combining from different sources, and formatting to conform to a target schema.
  3. Load: The transformed data is loaded into the destination with the predefined schema and constrained.

ELT solves many of the problems associated with ETL.

Most transformation logic can now be handled within the data warehouse using SQL, making it more accessible for users such as data analysts or data scientists. This eliminates the potential performance bottleneck of ETL pipelines.

Most importantly, ELT allows you to keep raw data in the warehouse. This approach offers several advantages. You don’t need to plan transformation logic in advance; instead, the logic can evolve over time based on analytical needs—an especially valuable benefit in today’s agile software development environment.

― ETL and ELT - Vu Trinh [Link]

Apache Airflow Overview - Vu Trinh [Link]

How did Airbnb build their semantic layer? - Vu Trinh [Link]

In a meltdown, discipline beats brilliance.

Over the years, I’ve found that having a simple rule-based system helps me stay grounded. Here are the 4 rules I follow to protect my portfolio:

  • I invest a fixed amount monthly — rain or shine.
  • I don’t add to losers — keeping them relatively small.
  • I don’t sell winners — staying the course and being patient.
  • I invest for at least 5 years — to give compounding time to work.

― Bear Market Survival Guide - App Economy Insights [Link]

Salesforce & AI Strategy - Generative Value [Link]

This article discusses the history of Salesforce, what made it successful, the state of the business, and the AI opportunity (or threat) today.

Everything Wrong with MCP - Shrivu's Substack [Link]

How to future-proof your career in the age of AI - Operator's Handbook [Link]

Key Takeaways:

The author’s call to "lean into human strengths while actively engaging with AI" is a compelling middle path. The essay underscores that the future belongs to those who combine AI literacy with irreplaceable human skills—judgment, influence, and adaptability.

Human Competitive Advantages:

  • Judgment & Conviction: Ability to make decisions with incomplete/ambiguous data. Distinguishing impactful work from "interesting but useless" projects. Simplifying complexity into actionable frameworks.
  • Influence & Execution: Navigating organizational politics and incentives. Building trust and adoption for AI-driven outputs. Understanding unspoken processes and relationships.

Actionable Skills to Cultivate:

  • Develop "taste" by studying excellence in your field.
  • Gain hands-on experience to pressure-test AI outputs.
  • Learn to align stakeholders and drive consensus.
  • Build strong interpersonal relationships and reputation.

Adaptability as the Ultimate Skill:

  • AI will keep evolving, so continuous learning and flexibility are critical.
  • Focus on areas where humans add unique value (judgment, influence, creativity).

This is a very interesting point: "Develop "taste" by studying excellence in your field."

Just like any skill, taste sharpens with exposure and effort. The more you study, critique, and create, the better you’ll get at recognizing—and producing—excellence. In a world flooded with AI-generated content, the people who thrive will be those who can separate the remarkable from the mediocre.

Blogs and Articles

How Airbnb Standardized Metric Computation at Scale - Airbnb Blog [Link]

Digital hygiene - Andrej Karpathy [Link]

Good tips and tricks for digital hygiene, given the pervasive nature of internet fraud and the data collection practices of major tech companies.

Measuring AI Ability to Complete Long Tasks - METR [Link]

metr-length-of-tasks-log

The "think" tool: Enabling Claude to stop and think in complex tool use situations - Anthropic [Link]

Anthropic introduces a "think" tool designed to enhance Claude's complex problem-solving by providing a dedicated space for structured reasoning during tasks. This tool differs from extended thinking by allowing Claude to pause and consider necessary information mid-response, particularly beneficial for multi-step processes and tool use. Evaluations on benchmarks like τ-Bench demonstrated significant performance improvements, especially in policy-heavy domains like airline customer service, where optimized prompting alongside the "think" tool proved most effective.

Tiny Agents: a MCP-powered agent in 50 lines of code - HuggingFace [Link]

Anthropic CEO wants to open the black box of AI models by 2027 - Techcrunch [Link]

Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future.

― The Urgency of Interpretability - Dario Amodei [Link]

Interpretability isn’t just academic—it’s a prerequisite for safe, controllable AI. The window to solve it is narrowing as AI grows more powerful. By steering resources toward this goal now, we might avoid a future where humanity builds systems it doesn’t understand but can’t afford to stop.

The Jobs That Will Fall First As AI Takes Over The Workplace - Forbes [Link]

Takeaways:

  1. Timeline for Disruption:
    • By 2030: 30% of U.S. jobs could be automated (McKinsey).
    • By 2035: White-collar restructuring in finance, legal, and media (Larry Fink, Jamie Dimon).
    • By 2045: 50% of jobs may be fully automated (Goldman Sachs).
    • By 2050: AI could dominate 60-80% of jobs, depending on innovation pace.
  2. Most Vulnerable Jobs (Near-Term):
    • Administrative: Data entry, scheduling, customer service (60% automatable, per IPPR).
    • Finance & Legal: Bookkeeping, contract drafting, paralegal work (AI tools like Harvey already achieve 90% accuracy).
    • Creative & Media: Basic graphic design, copywriting, journalism (30% at risk by 2035, Pew Research).
    • Routine STEM Tasks: Coding, data analysis (40% automatable by 2040, WEF).
  3. More Resilient Jobs (Longer-Term):
    • Healthcare: Nursing, therapy, and patient care (empathy-driven roles).
    • Skilled Trades: Construction, repair, maintenance (physical labor is harder to automate).
    • Education & Leadership: Teaching, high-level management (requires emotional intelligence).

To protect career:

  • Focus on critical thinking, creativity, and AI collaboration (e.g., prompt engineering, AI-augmented decision-making).
  • Target Resilient Sectors- Healthcare, education, skilled trades, and AI-adjacent roles (e.g., cybersecurity, AI ethics).
  • Push for employer or government-sponsored programs to transition into hybrid (human + AI) roles.
  • Embrace Hybrid Roles- Jobs that combine technical skills with human judgment (e.g., AI-assisted healthcare diagnostics) will thrive.

As Ray Dalio warns, the economy faces a "great deleveraging" where AI disrupts jobs faster than new ones emerge. The key is adaptability—those who proactively reinvent their skills today will shape the workforce of tomorrow.

Curation is the new leadership superpower. Here are 3 ways to adopt a curation mindset - FastCompany [Link]

The most transformative leaders of the next decade will be those who master the art of curation—seeing their role as a conduit for the best ideas, not the source of them.

The Obsolescence of the "Omniscient Leader": The pace of change, hyper-specialization, and interconnected challenges (e.g., AI, climate, global markets) make it impossible for one person to have all the answers. Leaders must shift from being "the smartest in the room" to becoming "architects of collective intelligence."

Curation as the Core Leadership Skill:

  1. Curating Talent: Prioritize cognitive diversity over homogeneity. Example: Diverse teams solve problems faster (39% efficiency boost).
  2. Curating Ideas: Create systems where unconventional thinking flourishes (e.g., Google’s 20% time → Gmail, Maps). Actively seek "outliers" (contrarians, outsiders) to challenge groupthink.
  3. Curating Innovation: Design for "structured serendipity" (e.g., Pixar’s open office, IDEO’s cross-industry brainstorming). Embrace cross-disciplinary collisions (e.g., NASA’s tech inspiring sportswear, biomimicry in architecture).

How to Cultivate a Curation Mindset:

  • Facilitate, don’t dictate: Ask better questions; let solutions emerge from debate (e.g., Amazon’s "Disagree and Commit").
  • Optimize for collaboration, not just efficiency: Space matters (physical or virtual).

Perplexity CEO says its browser will track everything users do online to sell ‘hyper personalized’ ads - TechCrunch [Link]

Perplexity is building a browser (Comet) to track user behavior across the web—explicitly to fuel targeted advertising. It highlights the company’s ambition to emulate Google’s surveillance-capitalism playbook.

Perplexity’s move confirms that the AI search revolution is less about displacing Google’s model than replicating it—with AI as a smarter wrapper for the same ads.

Today’s Most Crucial Leadership Skill Is Systems Thinking - Forbes [Link]

Leaders who master systems thinking don’t just survive uncertainty—they thrive in it, turning complexity into competitive advantage.

Five Key Tools of Systems Thinking for Strategic Leaders

  1. Problem Statements: Move from surface-level fixes to systemic solutions. Example: Instead of asking, “How do we get customers to recycle?”, ask, “How can we redesign products and infrastructure for circularity?”
  2. Stakeholder Mapping: Identify all affected parties—not just obvious ones. Example: For electric vehicles, consider miners of critical minerals, urban planners, and regulators, not just automakers and buyers.
  3. Iceberg Analysis: Look beneath visible events to uncover hidden structures and mindsets. Example: Employee burnout isn’t just about workload—it’s shaped by corporate culture, incentive systems, and societal norms.
  4. Causal Loops: Visualize feedback loops to see how actions create ripple effects. Example: A cost-cutting measure in one department may increase inefficiencies elsewhere.
  5. Iteration & Testing: Embrace adaptive strategies, not rigid plans. Example: Pilot small-scale solutions, measure impact, and refine before full rollout.

Perplexity CEO shares the Elon Musk–inspired mantra that helped him build the $9 billion rival to OpenAI - Fortune [Link]

Srinivas’s journey highlights resilience, speed, and Silicon Valley’s tight-knit founder network as key drivers of startup success.

  1. "It’s Only Over When You Give Up" – Aravind Srinivas, CEO of AI search startup Perplexity, draws inspiration from Elon Musk’s perseverance during SpaceX’s early failures. He told Harvard students that success comes from relentless self-belief, even when others doubt you.
  2. Rocketing Valuation – Perplexity, competing with Google and OpenAI, grew from a 1B to 9B and is now in talks to raise funds at an 18B valuation.
  3. Forget Pitch Decks, Build Fast – Srinivas advises founders to focus on rapid product iteration rather than lengthy business plans. He admits he doesn’t even know how to make a pitch deck—Perplexity’s success came from live demos.
  4. OpenAI Alumni Network – Despite competing with OpenAI, Srinivas maintains a strong relationship with Sam Altman (his former boss at OpenAI). This mirrors the "PayPal Mafia" dynamic, where ex-OpenAI employees now lead major AI firms like Anthropic and Safe Superintelligence.

Marc Andreessen predicts one of the few jobs that may survive the rise of AI automation - Fortune [Link]

Andreessen’s logic suggests focusing on roles where trust, psychology, and networks matter more than data crunching. But don’t underestimate AI’s ability to creep into those domains too.

How To Get Noticed Without Self-Promotion By Using Strategic Visibility - Forbes [Link]

Core Lessons:

  1. Hard Work ≠ Visibility: Doing great work is necessary but insufficient. If leaders don’t know what you’re doing, they can’t reward it. Waiting for annual reviews is too late—visibility requires consistent, intentional updates.
  2. Humility Has a Hidden Cost: While modesty is admirable, staying silent can render you invisible. Gallup’s data on declining engagement (just 36% in 2020) highlights how disengagement hurts promotion prospects. Visibility isn’t ego-driven; it’s about ensuring your impact is recognized.
  3. Visibility ≠ Bragging: Framing contributions as useful knowledge (e.g., "Here’s how I solved X") builds trust and leadership credibility. Sharing wins, failures, and best practices helps the team and positions you as a problem-solver.
  4. Tactical Ways to Increase Visibility
    • Share knowledge: Lead "lessons learned" sessions or contribute to internal newsletters.
    • Mentor others: Their success reflects your leadership.
    • Speak up strategically: One substantive insight per meeting > empty chatter.
    • Volunteer for high-impact projects: Align with organizational priorities.
    • Write internally: Document best practices to showcase thought leadership.
  5. Emotional Intelligence (EQ) Matters More Than Extroversion
    • Visibility is about meaningful engagement, not being the loudest.
    • Avoid self-deprecating language ("I’m sorry, but…")—speak with conviction.
  6. What Leaders Actually Notice
    • Initiative, influence, and alignment with goals matter more than face-time.
    • Working smart (not just late) and collaborating effectively signal leadership potential.

YouTube and Podcast

DOGE updates + Liberation Day Tariff Reactions with Ben Shapiro and Antonio Gracias - All-In Podcast [Link]

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo - Dwarkesh Patel [Link]

Trump vs Harvard, Nvidia export controls, how DEI killed Hollywood with Tim Dillon - All-In Podcast [Link]

E187 | 关税战难解美国制造业困境,旧秩序正在崩溃 - 硅谷101播客 [Link]

How DeepSeek Rewrote the Transformer [MLA] - Welch Labs [Link]

A lecture explaining the architecture and optimizations behind DeepSeek R1, a language model that improves Transformer efficiency.

Live Demo: Reinforcement Fine-Tuning for LLMs — Build Smarter Models with Less Data l Tutorial - Predibase [Link]

This video was talking about why RFT beats supervised fine-tuning (SFT) in reasoning tasks, giving live demo of an end-to-end RFT workflow, and PyTorch-to-Triton case study showing real-world impact.

Model Context Protocol (MCP), clearly explained (why it matters) - Greg Isenberg [Link]

Trump Rally or Bessent Put? Elon Back at Tesla, Google's Gemini Problem, China's Thorium Discovery - All-In Podcast [Link]

Suffering is mostly mental anguish and mental pain and it just means you don't want to do the task at hand.

The kind of fame that pure actors and celebrities have, I wouldn't want, but the kind of fame that's earned because you did something useful, why dodge that.

People will always want more status uh but I think you can be satisfied at a certain level of wealth.

Not the kind of confidence that would say I have the answer but the kind of confidence that I will figure it out and I know what I want or only I am a good arbiter of what I want.

Pride is the enemy of learning, so when I look at my friends and colleagues, the ones who are still stuck in the past and have grown the least are the ones who were the proudest, because they sort of feel like they already had the answers and so they don't want to correct themselves publicly.

I think everybody puts themselves first that's just human nature, you're here because you survive you're a separate organism.

The happier you are, the more you can sustain doing something, the more likely you're going to do something that will in turn make you even happier, and you'll continue to do it, and you'll outwork everybody else. The more free you are the better you can allocate your time.

There are no problems in the real world other than maybe things that inflict pain on your body. Everything else has to become a problem in your mind first.

Your family is broken but you're going to fix the world. People are running out there to try and fix the world when their own lives are a mess.

I think the only true test of intelligence is if you get what you want out of life and there are two parts to that one is getting what you want so you know how to get it and the second is wanting the right things knowing what to want in the first place.

Usually I think people end up there because they are going on autopilot with sort of societal expectations or other people's expectations or out of guilt or out of like mimetic desire.

Probably the biggest regret will be staying in the relationship after you knew it was over, exactly you should have left sooner, the moment you knew it wasn't going to work out, you should have moved on.

We are naturally hardwired to be pessimists but modern society is very different despite whatever problems you may have with modern society, it is far far safer than living in the jungle and just trying to survive and the opportunities.

Leave all those labels alone. It's better just to look at the problem at hand, look at reality the way it is, try to take yourself out of the equation in a sense.

The less you think about yourself the more you can think about a mission or about God or about a child or something like that.

I don't think there are any formulas i think it's unique to each person it's like asking a successful person how did you become successful each one of them will give you a different story uh you can't follow anyone else's path.

A lot of change is more about desire and understanding than it is about uh forcing yourself or trying to domesticate yourself.

When your mind is under stress, it's because it has two conflicting desires at once... and anxiety I think is sort of this pervasive unidentifiable stress where you're just kind of stressed out all the time and you're not even sure why and you can't even identify the underlying problem. I think the reason for that is because you have so many unresolved problems unresolved stress points that have piled up in your life that you can no longer identify what the problems are.

Life is going to play out the way it's going to play out there will be some good and some bad most of it is actually just up to your interpretation.

The gut is what decides the head, is kind of what rationalizes it afterwards, the gut is the ultimate decision maker.

You can't change other people, you can change your reaction to them.

If you do want to change someone's behavior, I I think the only effective way to do it is to compliment them when they do something you want, not to insult them or be negative or critical when they do something you don't want.

If you can't decide, the answer is no.

Almost invariably the advice that you would give yourself 10 years ago is still the advice that you need to hear today.

On mental things, I think understanding is way more important once you see the truth of something you cannot unsee it... when we really do see something clearly, it changes our behavior immediately, and that is far more efficient than trying to change your behavior through repetition.

Truth is often painful, if it wasn't, we'd all be seeing truth all the time. Reality is always reflecting truth that's all it is why would you not have accessed it already exactly... wisdom is the set of things that cannot be transmitted. If they could be transmitted you know we'd read the same five philosophy books, and we'd all be done, we'd all be wise. You have to learn it for yourself, it has to be rediscovered for yourself in your own context.

You're probably better off only caring about things that are local or things that you can affect. So if you really care about something that's in the news, then by all means care about it but make a difference go do something about it.

Desire is a contract to be unhappy until you get what you want.

The real currency of life is attention it's what you choose to pay attention to and and what you do about it.

― 44 Harsh Truths About Human Nature - Naval Ravikant (4K) - Chris Williamson [Link]

Key Learnings:

  • Someone who can do the job peacefully or happily is more effective than someone with unnecessary emotional turmoil
  • Fame sought for its own sake is fragile and leads to a constant need to perform.
  • People often say things they don't really believe, driven by a desire to be seen as something they are not.
  • Status is zero-sum and insatiable, unlike wealth. Status is often comparative, like leaderboards, where one person's gain can be another's loss.
  • Self-esteem comes from aligning actions with internal values, especially when difficult. Genuine sacrifice, doing something you want less for something you value more, can build self-esteem.
  • True confidence is not having all the answers but the self-belief to figure things out.
  • Pride is an enemy of learning and can lead to being stuck in past mistakes.
  • Everyone puts themselves first; unapologetic self-prioritization is rare but perhaps more honest. Much of what appears as altruism might be a waste of time if it goes against one's true desires.
  • Happiness and freedom are intertwined with efficiency and productivity.
  • Many emotional problems arise from the mind creating problems where none exist in the real world. He advises observing one's thoughts objectively to realize unnecessary emotional energy expenditure.
  • People often try to fix the world while their own lives are in disarray. He questions the credibility of those who cannot manage their own lives but seek to solve global issues.
  • True intelligence is getting what you want out of life by wanting the right things and knowing how to get them.
  • Many people go through life unconsciously following societal or mimetic desires. He emphasizes the importance of thinking things through for oneself rather than blindly following others.
  • Staying too long in bad situations (relationships, jobs) is a common regret.
  • We are naturally hardwired for pessimism due to evolutionary pressures to avoid ruin.
  • Humans are dynamic and labels like optimist, pessimist, introvert, extrovert are self-limiting.
  • Overthinking about oneself can lead to misery; focusing on something bigger can bring happiness. Overthinking and rumination do not help with happiness.
  • There are no universal formulas for success or happiness; each person's path is unique.
  • Lasting change comes from desire and understanding, not forcing oneself. He suggests aligning actions with genuine wants for maximal effectiveness.
  • Anxiety often stems from having many unresolved and conflicting desires.
  • Our interpretations of experiences shape our reality. The same experience can lead to different emotional responses based on individual interpretation.
  • The "gut" is the ultimate decision-maker, representing refined judgment accumulated through evolution and experience. He advises trusting this instinct once it's developed.
  • You cannot change other people, only your reaction to them. He adds that people change through their own insights or trauma, not by being told to.
  • Negative reinforcement is less effective than positive reinforcement in changing behavior.
  • If faced with a difficult choice and unable to decide, the answer is often "no." He also suggests that when choosing between two equal options, take the more painful path in the short term.
  • Understanding is more important than discipline for mental change.
  • Truth, though often painful, is constantly reflected by reality; wisdom is the personal rediscovery and contextual application of timeless truths. He also mentions that many important life lessons are "unteachable" in the sense that they must be experienced firsthand to be truly understood.
  • Memorization is becoming less valuable in the age of readily available information; understanding, judgment, and taste are more crucial. He links understanding to solving real problems and finding generalizable truths.
  • Philosophy evolves with new knowledge and perspectives. He explains how advancements in science and technology lead to different philosophical outlooks, and even moral philosophy progresses over time.
  • Many philosophical paradoxes can be resolved by considering different scales and timeframes. Naval suggests that seemingly contradictory questions like free will and determinism can be understood by shifting perspectives.
  • Coordination is essential for societal function; pure libertarianism is unsustainable.
  • Modern AI, while powerful, currently lacks true creativity and deep understanding.
  • Meaning can be more important than moment-to-moment happiness.
  • In an age of news saturation, it's a battle to maintain focus on what truly matters and what one can influence. He emphasizes that attention is the real currency of life and should be spent consciously. Attention, not time or money, is the most fundamental resource in life.
  • Getting past one's past is a skill achieved by processing it to be rid of it, not to dwell on it.

I think agents are real, but I think that we are far away from that because we're still at the phase of how do you build reliable software in production for an enterprise versus the toy apps that you see on the internet which is like let me vibe code something. I think these things are worlds apart still. - Chamath Palihapitiya

I think we have not yet figured out how to move the budgets from experimentation to mainline production. Meaning where large chunks of the US economy are comfortable enough with the ways in which hallucinations are managed such that they will replace legacy deterministic code with this new probabilistic model generated code meaning model enabled code. - Chamath Palihapitiya

― Trump's First 100 Days, Tariffs Impact Trade, AI Agents, Amazon Backs Down - All-In Podcast [Link]

Papers and Reports

Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI [Link]

This paper contributes to the enterprise AI landscape by offering a comprehensive architectural blueprint for deploying agentic, modular, and data-integrated AI systems that can efficiently leverage LLMs and enterprise assets.

Github

Google Gemini 2.0 with MCP (Model Context Protocol) Servers - Gemini Samples [Link]

Maestro - A Framework for Claude Opus, GPT and local LLMs to Orchestrate Subagents - maestro [Link]

MCP-Agent [Link]

Local Deep Researcher [Link]

News

Accelerate Generalist Humanoid Robot Development with NVIDIA Isaac GR00T N1 - NVIDIA [Link]

Announcing the Agent2Agent Protocol (A2A) - Google for Developers [Link]

Key Takeaways:

  1. A2A is an open-source protocol backed by 50+ tech giants (e.g., Salesforce, SAP, Cohere) and consultancies (e.g., Accenture, Deloitte). It allows agents from different vendors/frameworks to communicate, share data, and coordinate tasks without being locked into a single platform.
  2. Solving Enterprise Pain Points: Breaks down silos by letting agents interoperate across HR (Workday), CRM (Salesforce), ERP (SAP), and other systems. Example: A hiring manager’s agent can autonomously source candidates, schedule interviews, and run background checks by collaborating with specialized agents.
  3. Design Principles:
    • Agent-Centric: Supports unstructured, multi-agent collaboration (beyond rigid "tool" roles).
    • Built on Standards: Uses HTTP, JSON-RPC, and SSE for easy integration.
    • Secure by Default: Enterprise-grade auth (aligned with OpenAPI).
    • Long-Running Tasks: Handles tasks lasting hours/days with real-time updates.
    • Multimodal: Supports text, audio, video, and UI negotiations (e.g., web forms, iframes).

The Chinese goods Americans most rely on, from microwaves to Barbies - Financial Times [link]

Apple Vision Pro 2 Reportedly Cheaper & Lighter, Mac-Tethered Headset Coming Too - Upload [Link]

Others

Machine learning (ML) solutions applied to business problems across various industries:

  • ML and LLM system design: 500 case studies to learn from [Link]
  • Machine Learning and Data Science Applications in Industry - Firmai [Link]
  • Business Machine Learning - Firmai [Link]
  • ML System Design Case Studies Repository [Link]
  • 500+ Artificial Intelligence Project List with Code [Link]

Blogs and Articles

How to Build a Graph RAG App - Steve Hedden [Link]

Walking through building a graph rag app that improves LLM accuracy using knowledge graphs. It covers data preparation, search refinement with MeSH terms, and article summarization.

We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies. We continue to believe that iteratively putting great tools in the hands of people leads to great, broadly-distributed outcomes.

― Reflections - Sam Altman [Link]

Structured Report Generation Blueprint with NVIDIA AI [Link] [YouTube]

Sky-T1: Train your own O1 preview model within $450 - NovaSky [Link]

Agents - Chip Huyen [Link]

This guide provides a comprehensive exploration of AI-powered agents, focusing on their capabilities, planning, tool selection, and failure modes. It delves into the factors determining an agent's performance, how LLMs can plan, and how to augment planning capabilities. It also provides insights into agent failures and how to evaluate them effectively.

The Batch Issue 284 - DeepLearning.AI - Andrew Ng [Link]

Andrew Ng highlights AI Product Management’s growth as software becomes cheaper to build.

DeepSeek V3 LLM NVIDIA H200 GPU Inference Benchmarking - DataCrunch [Link]

Global-batch load balance almost free lunch to improve your MoE LLM training - Qwen [Link]

MoE models struggle with expert underutilization due to micro-batch-level load balancing, which fails when data within a batch lacks diversity. This results in poor expert specialization and model performance.

The paper proposes global-batch load balancing, where expert selection frequencies are synchronized across all parallel groups, ensuring more effective domain specialization and improved performance.

Global-batch load balancing outperforms micro-batch balancing in all tested configurations. It shows improved performance and expert specialization, with models achieving better results across various data sizes and domains.

How to Evaluate LLM Summarization - Isaac Tham [Link]

A quantitative, research-backed framework for evaluating LLM summaries, focusing on conciseness and coherence. This guide explores challenges in summarization evaluation, defines key quality metrics (conciseness, coherence), and improves the Summarization Metric in the DeepEval framework. Includes a GitHub notebook for applying these methods to assess summaries of long-form content systematically.

We just gave sight to smolagents - HuggingFace [Link]

This tutorial is about how to integrate vision capabilities into autonomous agents using smolagents. It explains passing images to agents in two ways: at initialization or dynamically via callbacks. It demonstrates building a web-browsing agent with vision using the MultiStepAgent class and helium. The agent performs actions like navigation, popup handling, and dynamic webpage analysis.

On DeepSeek and Export Controls - Dario Amodei [Link]

Highlighting export controls' impact on AI geopolitics.

Workflows and Agents - LangGraph [Link]

Review of common patterns for agentic systems.

Constitutional Classifiers: Defending against universal jailbreaks - Anthropic [Link] [YouTube]

Anthropic invites everyone to test its new safety classifier that eradicates jailbreaks and further increases Claude's over-refusal rate.

Open-source DeepResearch – Freeing our search agents - HuggingFace [Link]

Hugging Face challenges OpenAI’s Deep Research with an open-source alternative, beating previous SOTA by 9 points.

Choosing the Right AI Agent Framework: LangGraph vs CrewAI vs OpenAI Swarm - Yi Zhang [Link]

Compare LangGraph, CrewAI, and OpenAI Swarm frameworks for building agentic applications with hands-on examples. Understand when to use each framework, and get a preview of debugging and observability topics in Part II.

How to Scale Your Model - Google DeepMind [Link]

Learn how to scale LLMs on TPUs by understanding hardware limitations, parallelism, and efficient training techniques. Explore how to estimate training costs, memory needs, and optimize performance using strategies like data, tensor, pipeline, and expert parallelism. Gain hands-on experience with LLaMA-3, and learn to profile and debug your code.

Three Observations - Sam Altman [Link]

Sam outlines AI trends: AI’s scaling limits, cost reduction, and the future of autonomous agents.

How to deploy and fine-tune DeepSeek models on AWS - HuggingFace [Link]

Deploy and fine-tune DeepSeek-R1 models on AWS using Hugging Face with GPUs, SageMaker, and EC2 Neuron.

Building a Universal Assistant to connect with any API - Pranav Dhoolia [Link]

Convert any OpenAPI spec into an MCP-compatible API assistant without writing custom integration code. Use a generic MCP server to expose API endpoints dynamically. This approach simplifies integration, expands compatibility, and makes scaling API support more efficient.

From PDFs to Insights: Structured Outputs from PDFs with Gemini 2.0 - Philschmid [Link]

Learn to convert PDFs into structured JSON using Gemini 2.0. Set up the SDK, process files, manage tokens, and define JSON schemas with Pydantic. Covers real-world examples like invoices and forms, best practices, and cost management, works within the free tier.

The Hidden Ways We Really Work Together - Microsoft [Link]

Managing LLM implementation projects - Piotr Jurowiec [Link]

Discover how to implement LLMs from initial planning to deployment. Establish project goals, select suitable architectures, preprocess data, train and evaluate models, optimize hyperparameters, and incorporate domain expertise. Tackle challenges such as hallucinations, security risks, regulatory compliance, and scalability limitations. Develop systematic workflows for building and managing LLM-based applications.

How to build a ChatGPT-Powered AI tool to learn technical things fast - AWS [Link]

What Problem Does The Model Context Protocol Solve? - AIhero [Link]

Learn how the Model Context Protocol (MCP) simplifies integrating large language models (LLMs) with external APIs.

MCP acts as a connector between LLMs and external data sources, facilitating interactions with tools without requiring LLMs to understand intricate APIs. By providing a standardized interface, it streamlines integrations with platforms like GitHub, enhancing workflow speed and efficiency.

Most AI value will come from broad automation, not from R&D - Epoch AI [Link]

Epoch AI's article argues against the popular notion that the primary economic benefit of artificial intelligence will stem from its application in research and development. Instead, the authors posit that AI's most significant value will arise from its widespread deployment in automating existing labor across various sectors.

Substack

Tencent: Betting Big on AI - App Economy Insights [Link]

tencent_corporate_overview

Tencent's proprietary HunYuan framework has developed into a central AI platform catering to both consumers and enterprises. Originally centered on text and conversational AI, HunYuan has expanded to support multimodal capabilities, including image, video, and 3D generation, where it has attained top rankings in industry benchmarks.

HunYuan_Thesis

Tencent has a Dual-Core AI strategy: It combines its proprietary T1 model with external AI, such as DeepSeek’s R1, in a “double-core” approach. Yuanbao chatbot utilizes both—T1 for deep reasoning and R1 for quick responses—while WeChat Search enhances accuracy by integrating T1 with DeepSeek.

tencent_multi_model_strategy

Google: Biggest Deal Ever - App Economy Insights [Link]

Alphabet has completed its largest acquisition to date with a $32 billion deal to acquire cloud security startup Wiz. If successful, this move could redefine GCP’s security portfolio, strengthening its stance as AI-driven cloud computing becomes the focal point.

google_biggest_acquisition

Papers and Reports

Whitepaper Agents - Authors: Julia Wiesinger, Patrick Marlow and Vladimir Vuskovic [Link]

Google’s whitepaper explains how AI agents use reasoning, tools, and external data to automate tasks, turning large language models (LLMs) into workflow automation systems. Google suggests using LangChain for prototyping and Vertex AI for scaling production-ready agents. its framework provides a standardized approach to ensure reliable AI agent execution.

Key Components

  1. Decision Engine – The LLM plans and executes tasks using reasoning methods like ReAct or Chain-of-Thought.
  2. Tool Integration – Agents interact with APIs, databases, and real-time data.
  3. Orchestration Layer – Manages task execution and decision-making.

Tool Types

  1. Extensions – Directly call APIs for automation.
  2. Functions – Allow developers to control execution.
  3. Data Stores – Use retrieval-augmented generation (RAG) for external data access.

Use Cases

Agents handle tasks like personalized recommendations, workflow automation, and database queries. For example, they can fetch a user’s purchase history and generate tailored responses.

Introducing smolagents, a simple library to build agents - HuggingFace [Link]

Memory Layers at Scale - Meta [Link]

2 OLMo 2 Furious - Alien AI [Link]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [Link]

This paper shows how to build domain-specific reasoning models using a two-stage training process. HuatuoGPT-o1, a medical LLM, enhances complex reasoning using this two-stage approach: (1) supervised fine-tuning (SFT) with complex Chain-of-Thought (CoT) and (2) reinforcement learning (RL) using a verifier to refine reasoning.

huatuogpto1

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps - Google DeepMind [Link]

Google DeepMind introduces noise search method, outperforming traditional denoising in diffusion models.

Chain of Agents: Large language models collaborating on long-context tasks - Google Research [Link] [Paper]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training - Google DeepMind [Link] [Link]

Explaining why reinforcement learning outperforms supervised fine-tuning for model generalization.

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [Link]

A new preference optimization algorithm for LLM-as-a-Judge models.

Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks [Link]

Generative AI models often produce hallucinations, making them less reliable and reducing trust in AI systems. In this work, a multi-agent system is designed using over 300 prompts to induce hallucinations. AI agents at different levels review and refine outputs using distinct language models, structured JSON communication, and the OVON framework for seamless interaction. New KPIs are introduced to measure hallucination levels.

ELEGNT: Expressive and Functional Movement Design for Non-Anthropomorphic Robot - Apple [Link] [Link]

This is very cool.

π0 and π0-FAST: Vision-Language-Action Models for General Robot Control - Hugging Face [Link]

Hugging Face publishes the first open-source robotics foundation models for real-world applications.

Claude’s extended thinking - Anthropic [Link]

Claude 3.7 Sonnet introduces extended thinking, visible reasoning, and improved agentic capabilities for complex tasks.

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [Link]

This study evaluated the ability of LLMs to generate novel, expert-level research ideas compared to human experts by recruiting over 100 NLP researchers for idea generation and blind reviews. Results showed that LLM-generated ideas were rated as more novel than human ideas (p < 0.05) but slightly less feasible. While LLMs demonstrated promising ideation capabilities, challenges such as limited idea diversity and unreliable self-evaluation were identified, highlighting areas for improvement in developing effective research agents.

Transformers without Normalization [Link]

Yann LeCun and his team have proposed Dynamic Tanh (DyT) as an alternative to conventional normalization layers in deep learning models. This innovative method, leveraging the scaled tanh function, delivers performance on par with or superior to techniques like LayerNorm and RMSNorm. Notably, its ability to lower computational costs while preserving model efficiency makes it particularly compelling.

Energy [Link]

Most serious issues:

  • Aging and overburdened energy infrastructure is the most serious issue.
  • Energy demand is at its highest growth rate in 20 years. EV adoption and AI workloads are accelerating the strain on the grid.
  • There is increased frequency of extreme weather events causing outages.
  • The U.S. experienced twice as many weather-related power outages from 2014–2023 compared to 2000–2009.

Some key trends:

  • Energy demand is rising rapidly, especially due to data centers, AI, and EV adoption.
  • Extreme weather is causing more frequent and severe power outages.
  • The transition to renewables is accelerating, but grid interconnection delays are slowing progress.
  • Grid infrastructure is aging and requires massive investment, but funding gaps remain.
  • Transformer shortages and long lead times are hindering grid expansion and maintenance.
  • Cybersecurity threats and physical attacks on substations are emerging risks.

YouTube and Podcasts

Fixing the American Dream with Andrew Schulz - All-In Podcast [Link]

E179|DeepSeek技术解析:为何引发英伟达股价下跌?- 硅谷101播客 [Link]

This Year in Uber’s AI-Driven Developer Productivity Revolution - Gradle [Link]

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem - AI Engineer [Link]

Great intro to Graph RAG from Prof Emil Eifrem. Check out neo4j genai ecosystem.

Some articles mentioned: "Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering" [Link], and "GraphRAG: Unlocking LLM discovery on narrative private data" [Link].

Building a fully local "deep researcher" with DeepSeek-R1 - LangChain [Link]

This tutorial reviews DeepSeek R1's training methods, explains downloading the model via Ollama, and demonstrates JSON-mode testing. Test its local "deep research" assistant, which performs web research and iterative summarization with reflection for improved results.

Building Effective Agents with LangGraph - LangChain [Link]

This video shows the difference between agents and workflows and when to use each. You'll Implement patterns like prompt chaining, parallelization, and routing using LangGraph. The session covers building agents, applying advanced patterns, and understanding how LangGraph enhances automation and optimization in AI systems.

Nvidia's GTC 2025 Keynote: Everything Announced in 30 Minutes - Amrit Talks [Link]

How I use LLMs - Andrej Karpathy [Link]

White House BTS, Google buys Wiz, Treasury vs Fed, Space Rescue - All-In Podcast [Link]

Satya Nadella – Microsoft’s AGI Plan & Quantum Breakthrough - Dwarkesh Partel [Link]

Satya Nadella discusses AI, AGI skepticism, economic growth, quantum computing, AI pricing, gaming models, and legal challenges. Notes of insights impressed me:

  1. Nadella believes hyperscalers (like Microsoft Azure, AWS, and Google Cloud) will be major beneficiaries of AI advancements. The exponential growth in compute demand for AI workloads—both for training and inference—will drive massive infrastructure needs. Hyperscalers are well-positioned to meet this demand due to their ability to scale compute, storage, and AI accelerators efficiently.
  2. He argues that hyperscale infrastructure is not a winner-takes-all market. Enterprises and corporations prefer multiple suppliers to avoid dependency on a single vendor. This structural dynamic ensures competition and prevents monopolization.
  3. While there may be a few dominant closed-source AI models, Nadella predicts that open-source alternatives will act as a check, preventing any single entity from monopolizing the AI model space. He draws parallels to the coexistence of closed-source (e.g., Windows) and open-source systems in the past.
  4. He highlights that governments worldwide are unlikely to allow private companies to dominate AI entirely. Regulatory and state involvement will likely shape the landscape, further preventing a winner-takes-all scenario.
  5. In consumer markets, network effects can lead to winner-takes-all dynamics (e.g., ChatGPT's early success). However, in enterprise markets, multiple players will thrive across different categories.
  6. He disagrees with the notion that AI models or cloud infrastructure will become commoditized. At scale, the complexity of managing hyperscale infrastructure and the know-how required to optimize it create significant barriers to entry and sustain profitability.
  7. Microsoft aims to build a versatile hyperscale fleet capable of handling large training jobs, inference workloads, and specialized tasks like reinforcement learning (RL). The company focuses on distributed computing, global data center placement, and high utilization of resources to meet diverse AI demands.
  8. Nadella envisions a future where AI agents and specialized models will drive even greater compute demand. He emphasizes the importance of building infrastructure that can support both training and inference at scale, while also accommodating evolving AI research and development.
  9. Microsoft Research (MSR) has a history of investing in fundamental, curiosity-driven research, often with no immediate payoff. Nadella emphasizes the importance of maintaining this culture, even if the benefits may only materialize decades later. Nadella highlights the difficulty of transitioning from research breakthroughs to scalable products. The role of leadership is to ensure that innovations are not only technically sound but also commercially viable.
  10. Nadella envisions quantum computing being accessed via APIs, similar to how cloud services are used today. This could democratize access to quantum capabilities for research and industry.

Instrumenting & Evaluating LLMs - Hamei Husain [Link]

How to Get Your Data Ready for AI Agents (Docs, PDFs, Websites) - Dave Ebbelaar [Link]

The AI Cold War, Signalgate, CoreWeave IPO, Tariff Endgames, El Salvador Deportations - All-In Podcast [Link]

Github

a smol course - HuggingFace [Link]

Good course to help you to align LLM with specific use cases. It includes instruction tuning, preference alignment using DPO/ORPO, LoRA, prompt tuning, and multimodal model adaptation, and it covers creating synthetic datasets, evaluation, and efficient inference.

AI Hedge Fund [Link]

POC for educational purposes.

build-your-own-x [Link]

Interesting resources / tutorials for building technologies from scratch to deepen practical understanding.

Open R1- HuggingFace [Link]

A fully open reproduction of DeepSeek-R1.

AI Tutor Using RAG, Vector DB, and Groq [Link]

GenAI Agents: Comprehensive Repository for Development and Implementation [Link]

DeepSeek Open Infra [Link]

Awesome MCP Servers [Link]

This repository provides access to a selection of curated Model Context Protocol (MCP) servers designed for seamless AI model-resource interaction. It features both production-ready and experimental servers, offering capabilities like file access, database connections, and API integrations. There are frameworks, tutorials, and practical tips to enhance model deployment and maximize resource efficiency in real-world applications.

News

Replit Integrates xAI [Link]

Expand your research with proprietary financial data - Perplexity [Link]

Crunchbase, FactSet, and DeepSeek R1 now power enterprise insights.

Google to acquire cloud security startup Wiz for $32 billion after deal fell apart last year - CNBC [Link]

Leave it to Manus - Manus [Link]

I just finished reading "Daring Greatly: How the Courage to Be Vulnerable Transforms the Way We Live, Love, Parent, and Lead" by Brené Brown. This is the second book of hers I have read. Her words feel like whispers from God.

What she says about wholeheartedness:

“Wholehearted living is about engaging in our lives from a place of worthiness. It means cultivating the courage, compassion, and connection to wake up in the morning and think, No matter what gets done and how much is left undone, I am enough. It's going to bed at night thinking, Yes, I am imperfect and vulnerable and sometimes afraid, but that doesn't change the truth that I am also brave and worthy of love and belonging.”

What she says about vulnerability:

“Vulnerability is based on mutuality and requires boundaries and trust. It's not oversharing, it's not purging, it's not indiscriminate disclosure, and it's not celebrity-style social media information dumps. Vulnerability is about sharing our feelings and our experiences with people who have earned the right to hear them. Being vulnerable and open is mutual and an integral part of the trust-building process.”

“If we're going to find our way out of shame and back to each other, vulnerability is the path and courage is the light. To set down those lists of what we're supposed to be is brave. To love ourselves and support each other in the process of becoming real is perhaps the greatest single act of daring greatly.”

What she says about perfectionism: (I love this part!)

“The problem was thankfully never fixed, and in time the box overflowed as more and more art piled up. I think the dilemma exists because art, among all the other tidy categories, most closely resembles what it is like to be human. To be alive. It is our nature to be imperfect. To have uncategorized feelings and emotions. To make or do things that don't sometimes necessarily make sense.

Art is all just perfectly imperfect.

My fixation with these words from Leonard Cohen's song "Anthem" comes from how much comfort and hope they give me as I put "enough" into practice: "There's a crack in everything. That's how the light gets in."”

What she says about oversharing:

“It's an important question, and the answer is that I don't tell stories or share vulnerabilities with the public until I've worked through them with the people I love. I have my own boundaries around what I share and what I don't share and I stay mindful of my intentions.

First, I only share stories or experiences that I've worked through and feel that I can share from solid ground. I don't share what I define as "intimate" stories, nor do I share stories that are fresh wounds.

Second, I follow the rule that I learned in my graduate social work training. Sharing yourself to teach or move a process forward can be healthy and effective, but disclosing information as a way to work through your personal stuff is inappropriate and unethical.

Last, I only share when I have no unmet needs that I'm trying to fill. I firmly believe that being vulnerable with a larger audience is only a good idea if the healing is tied to the sharing, not to the expectations I might have for the response I get.

What she says about disengagement:

“Disengagement is the issue underlying the majority of problems I see in families, schools, communities, and organizations and it takes many forms, including the ones we discussed in the "Armory" chapter. We disengage to protect ourselves from vulnerability, shame, and feeling lost and without purpose. We also disengage when we feel like the people who are leading us—our boss, our teachers, our principal, our clergy, our parents, our politicians-aren't living up to their end of the social contract.”

“The gap starts here: We can't give people what we don't have. Who we are matters immeasurably more than what we know or who we want to be. The space between our practiced values (what we're actually doing, thinking, and feeling) and our aspirational values (what we want to do, think, and feel) is the value gap, or what I call "the disengagement divide." It's where we lose our employees, our clients, our students, our teachers, our congregations, and even our own children.”

What she says about vulnerabilities in Sales:

“My answer was no. And yes. In that scenario vulnerability is recognizing and owning that you don't know something; it's looking the customer in the eye and saying, "I don't know the answer to that, but I'll find out. I want to make sure you have the correct information." I explained that the unwillingness to engage with the vulnerability of not knowing often leads to making excuses, dodging the question, or-worst-case scenario-bullshitting. That's the deathblow in any relationship, and the one thing I've learned from talking to people who sell for a living is that sales is all about relationships.”

And her Daring Greatly Leadership Manifesto:

“To the CEOs and teachers. To the principals and the managers. To the politicians, community leaders, and decision makers:

We want to show up, we want to learn, and we want to inspire.

We are hardwired for connection, curiosity, and engagement.

We crave purpose, and we have a deep desire to create and contribute.

We want to take risks, embrace our vulnerabilities, and be courageous.

When learning and working are dehumanized, when you no longer see us and no longer encourage our daring, or when you only see what we produce or how we perform, we disengage and turn away from the very things that the world needs from us: our talent, our ideas, and our passion.

What we ask is that you engage with us, show up beside us, and learn from us.

Feedback is a function of respect; when you don't have honest conversations with us about our strengths and our opportunities for growth, we question our contributions and your commitment.

Above all else, we ask that you show up, let yourself be seen, and be courageous. Dare Greatly with us.”

Substack

The key aspect of managing up is to learn to speak the language of your counterpart. If you can speak their language you can understand their goals and fears, and you can communicate at the level they are. You'll be in a better position to be an effective report. Umberto Nicoletti, Head of R&D at Proemion

The better we understand the goals that our managers have, the less surprising their actions will be. […] Some of the situations where managers act in ways that most dismay or surprise us are when they are acting on their fears and worries. - Joe Chippindale, CTO Coach

― Frameworks for Managing Up as a Software Engineer - High Growth Engineer [Link]

Building Trust:

  1. Sincerity — you are honest and transparent, even when it’s uncomfortable. This includes admitting mistakes early, being upfront with challenges, and sharing both good and bad news, without sugar-coating the latter.
  2. Reliability — this is about consistency and following through. You do what you say you'll do, you set realistic expectations, and communicate proactively through regular update habits. More on this later in the updates section.
  3. Care — you have their best interests in mind. This means understanding their goals and challenges, being proactive in helping them succeed, and showing empathy when things get tough.
  4. Competence — finally, you deliver results. This goes beyond technical skills: it's about delivering business value, learning and growing from feedback, and understanding the big picture.

Speaking their language:

  1. Map their context

    1. What makes you successful? — What are your goals and concerns?
    2. What makes me successful? — How can I help you reach your goals?

    The only way for you to be successful is to make your manager successful. To do that, you need to be able to map your goals and concerns into their own.

  2. Translate impact across altitudes

    For any item you report to your manager, the question you should ask yourself is: why should my manager care about this? And, more subtly: what does my manager care about this?

  3. Create explicit agreements

    1. Scope of ownership — do you know what decisions you can make autonomously vs when you need to involve your manager?
    2. Success criteria — how do you know if what you do is successful? Do you know how impact will be measured?
    3. Mutual expectations — do you know what your manager needs from you? And do they know what you need from them?

Creating effective updates

  1. Define your update stack

    1. Async messages (daily) — about significant progress or blockers.
    2. Written reports (weekly) — structured updates about key results and next steps.
    3. 1:1s (weekly or biweekly) — deeper conversations about growth, wellbeing, and strategy.
  2. Make every update count

    1. Why does this matter to my manager?
    2. What should they do with this information?
  3. Build a feedback loop

    Use 1:1s, retrospectives, and feedback moments to inspect your update process: what's working? What feels like noise? What critical information is missing?

JD Vance's AI Summit Paris Speech - Artificial Intelligence Survey [Link] [YouTube]

Here are some of JD Vance's main points regarding AI, on behalf of the Trump Administration:

  • Vance emphasizes AI's potential for revolutionary applications and economic innovation and advocates against being too risk-averse. This is the main stance of this optimistic speech - more of AI opportunity, less of AI safety.
  • He states the administration aims to ensure American AI technology remains the gold standard and the U.S. is the preferred partner for AI expansion. The U.S. wants to partner with other countries in the AI revolution with openness and collaboration, but this requires international regulatory regimes that foster creation rather than strangling it.
  • He expresses concern that excessive regulation could stifle the AI industry and supports a deregulatory approach. He mentions the development of an AI action plan that avoids overly precautionary regulatory regimes, while ensuring that all Americans benefit from the technology and its transformative potential. The administration is troubled by foreign governments tightening regulations on U.S. tech companies with international footprints. Vance states that preserving an open regulatory environment has encouraged American innovators to experiment.
  • He stresses that American AI should not be co-opted for authoritarian censorship and should be free from ideological bias.
  • He notes the importance of building the most powerful AI systems in the U.S. with American-designed and manufactured chips.
  • He believes AI should be a tool for job creation and making workers more productive, prosperous, and free. The administration will always center American workers in its AI policy and ensure that AI makes workers more productive. For all major AI policy decisions coming from the federal government, the Trump Administration will guarantee American workers a seat at the table.
ai_action_submit_jd_tweets

Elon Musk Blocked a Bill to Stop Amazon from Helping Kids Kill Themselves - BIG by Matt Stoller [Link]

In December, Elon Musk pushed for the reduction of government funding legislation, which led to the removal of several provisions. One provision removed due to Musk's intervention was the Youth Poisoning Prevention Act, which would have prevented consumers from buying concentrated sodium nitrite, a chemical often used in teenage suicides. This chemical, while used in low concentrations as a food preservative, is lethal in high concentrations and has no household uses.

The article highlights that Musk, who has significant political power, can make harmful mistakes, sometimes unknowingly. The author notes that the removal of the provision was considered a mistake that could be fixed. Despite bipartisan support for the priorities, there has been no action taken to reinstate them. He questions whether anyone will address and rectify the issues that arise from actions taken by figures like Musk and Trump.

Deep Research, information vs. insight, and the nature of science - Interconnects [Link]

This is a very interesting point: the article considers how AI might challenge Thomas Kuhn's theories of scientific revolutions. Kuhn's The Structure of Scientific Revolutions describes how science evolves, with scientists forming paradigms around theories and using them to gain knowledge until limitations necessitate a new paradigm. Here's how AI might challenge Kuhn's theories:

  • AI is accelerating scientific progress, potentially faster than paradigms can be established. The fundamental unit of scientific progress is reducing so quickly that it redefines experimentation methods.
  • Kuhn emphasizes that scientific knowledge is a process, not a set of fixed ideas. AI's emergence challenges this.
  • Kuhn suggests science is done by a community that slowly builds out the frontier of knowledge, rather than filling in a known space. The article questions how the dynamics of science will change with AI systems.
  • Kuhn states that to reject one paradigm requires the simultaneous substitution of another. The article implies that AI's rapid advancements may disrupt this process.

Check out this impressive list of stories they’ve broken since Trump took office:

When It Comes to Covering Musk's Government Takeover, WIRED Is Showing Everyone How It's Done - The Present Age [Link]

The Media Is Missing the Story: Elon Musk Is Staging a Coup - The Present Age [Link]

Mainstream media is refusing to tell the truth. WIRED deserves support.

"Character Limit - How Elon Musk Destroyed Twitter" by Kate Conger and Ryan Mac, as a reference.

2025: the Year of Datacenter Mania - AI Supremacy [Link]

This is an overview of what's happening and going to happen around Data Center Construction, covering a wide range of areas.

AI Expansion and Energy Demand:

The AI race is intensifying, leading to significant capital expenditure by Big Tech and raising concerns about potential harmful consequences. AI data centers' power demands are rapidly increasing, with estimates of needing 10 gigawatts of additional capacity in 2025 alone.

Goldman Sachs Research projects a 165% increase in data center power demand by 2030. By 2027, global AI data center power demand could reach 68 GW and 327 GW by 2030, compared to a total global data center capacity of 88 GW in 2022. Training AI models could require up to 1 GW in a single location by 2028 and 8 GW by 2030.

Infrastructure and Logistical Challenges:

Power infrastructure delays are increasing wait times for grid connections, which can take four to seven years in key regions. Data centers face struggles with local and state permits, especially for backup generators and environmental impact assessments.

A lack of data center infrastructure in the U.S. could cause a shift of construction to other countries. Countries with greater compute access may gain economic and military advantages.

Environmental and Health Concerns:

There are growing concerns that the impact of data centers on human health is being overlooked, and one of President Biden's executive orders acknowledges that data centers are harmful to health.

The environmental cost of AI includes concerns about water consumption, air pollution, electronic waste, and critical materials, in addition to public health concerns around pollution.

Energy Solutions and the Nuclear Option:

To meet AI’s growing power needs, some experts advocate for nuclear energy as the most viable long-term solution. Nuclear energy produces no carbon emissions during operation and offers a reliable, constant energy supply. Tech giants like Microsoft and Google are recognizing nuclear energy’s potential, with Microsoft exploring small modular reactors (SMRs). The adoption of nuclear energy faces obstacles such as high upfront costs, regulatory hurdles, and public skepticism.

Global AI Race and Investments:

The EU is mobilizing $200 billion in AI investments, signifying a global race for AI leadership. The UAE is investing billions in AI data centers in France and is implicated, along with SoftBank and Oracle, in OpenAI's data center project in Abilene, Texas.

The Question of Sustainability:

AI's rapid expansion is testing the limits of power infrastructure, natural resources, and sustainability efforts. If AI continues to expand at its current rate, there is a risk of a gridlocked future limited by energy availability. The future of AI depends on sustainability and the willingness to sacrifice energy for intelligence.

Amazon: Outspending Everyone - App Economy Insights [Link]

Some highlights:

  1. Amazon is planning to invest over $100 billion in 2025, primarily in AI-related infrastructure. This is more than any other company, and a 20% increase from 2024.
  2. AWS revenue grew 19% Y/Y, and roughly half of the growth is attributed to AI. AWS has a 30% market share in cloud infrastructure. Amazon is focused on custom silicon (Trainium and Inferentia) to improve AI efficiency.
  3. The de minimis exemption, which allows imports under $800 to avoid US tariffs, gives companies like Shein and Temu a competitive edge. Amazon Haul was launched last year to compete directly with these companies. Should the de minimis loophole be eliminated, Amazon's superior logistics network could give it an advantage in fulfillment and reliability.
  4. Amazon Prime's multi-faceted membership is highly effective at reducing churn. A 2022 study by the National Research Group found that Prime has one of the lowest churn rates, second only to cloud storage and music streaming services. Amazon's detailed purchase data provides advertisers with a valuable advantage, enabling highly targeted CTV ads with industry-leading returns on ad spend (ROAS).

Uber’s Three-Pronged AV Strategy:

  1. Fleet partnerships: Uber isn’t building its own AVs. Instead, it partners with companies like Waymo, Motional, and Aurora, integrating their fleets into Uber’s network.
  2. Hybrid model: AVs can’t handle all trips—human drivers will fill gaps, handling extreme weather, complex routes, and peak hours for decades.
  3. Fleet infrastructure: Uber is investing in charging depots and fleet management to maximize AV asset utilization.

While Tesla is vertically integrated, its rideshare strategy may take a different path. If Tesla adopts an asset-light model, Tesla owners—not Tesla itself—would decide whether to list their AVs on Uber. If maximum utilization is the goal, Uber could be the logical choice.

When it comes to demand aggregation, Uber remains the undisputed leader—its network effects ensure that as long as it aggregates supply, demand will follow, and gross profit will scale.

While the rideshare market will become more fragmented, Uber could still be the biggest fish in a much larger pond. After all, Uber is already the Airbnb for cars.

Tesla has a massive opportunity once the pieces fall into place. But with auto sales under pressure and market share declining, it still faces a long road ahead before claiming the top spot in any market.

― Tesla vs. Uber: Collision Course? - App Economy Insights [Link]

Uber's business model is one of my favorite business models. Not only because it's asset-light, its network effect, etc, but also because it created millions of jobs.

Uber's AV strategy is designed to balance innovation with practicality, ensuring that the company remains competitive while minimizing risks and costs. By leveraging partnerships, maintaining a hybrid model, and investing in infrastructure, Uber is well-positioned to lead the transition to autonomous mobility.

While a partnership between Uber and Tesla is possible and could offer significant synergies, it is not guaranteed. The decision would depend on whether both companies can align their goals and overcome competitive tensions. If Tesla decides to prioritize its own ride-hailing network (Tesla Network), it may choose to compete rather than collaborate with Uber. However, if Tesla sees more value in leveraging Uber’s platform and customer base, a partnership could be a strategic move for both companies.

uber_investor_pre

Microsoft: AI Efficiency Paradox - App Economy Insights [Link]

Google: Capex Arms Race - App Economy Insights [Link]

The End of Search, The Beginning of Research - One Useful Thing [Link]

Huang’s take: “We've really only tapped consumer AI and search and some amount of consumer generative AI, advertising, recommenders, kind of the early days of software. […] Future reasoning models can consume much more compute.”

DeepSeek-R1, he said, has “ignited global enthusiasm” and will push reasoning AI into even more compute-intensive applications.

― NVIDIA: AI's 3 Scaling Laws - App Economy Insights [Link]

Huang introduced a framework for AI’s evolving compute demands, outlining three scaling laws:

  • Pre-training scaling: Traditional model growth through data consumption, now enhanced by multimodal learning and reasoning-based data.
  • Post-training scaling: The fastest-growing compute demand, driven by reinforcement learning from human and AI feedback. This phase now exceeds pre-training in compute usage due to the generation of synthetic data.
  • Inference & reasoning scaling: The next major shift, where AI engages in complex reasoning (e.g., chain-of-thought, search). Inference already requires 100x more compute than early LLMs and could scale to millions of times more.

Jensen Huang outlined a three-layer AI transformation across industries:

  • Agentic AI (Enterprise AI): AI copilots and automation tools boosting productivity in sectors like automotive, finance, and healthcare.
  • Physical AI (AI for Machines): AI-driven training systems for robotics, warehouses, and autonomous vehicles.
  • Robotic AI (AI in the Real World): AI enabling real-world interaction and navigation, from self-driving cars to industrial robots.

Grab: The Uber Slayer - App Economy Insights [Link]

DeepSeek isn’t a threat—it’s validation. If AI inference costs are falling, Meta stands to benefit more than almost any other company. Instead of challenging its strategy, DeepSeek reinforces that heavy AI investments will pay off—not the other way around.

― Meta: DeepSeek Tailwinds - App Economy Insights [Link]

Elon Musk and spiky intelligence - Silver Bulletin [Link]

Interesting study on spiky intelligence, using Elon as a case study. Concepts highlights:

Spiky Intelligence: This refers to individuals who exhibit exceptional abilities in certain areas while being deficient in others. It contrasts with the idea of general intelligence (the "g factor"), where most cognitive abilities are positively correlated. Spiky intelligence is often seen in people who excel in abstract, analytical reasoning but may lack emotional intelligence, empathy, or practical judgment.

Berkson’s Paradox: This statistical phenomenon explains why successful individuals often appear to have significant weaknesses. In highly competitive fields, it’s rare to find people who excel in all dimensions, so success often goes to those with a few standout traits.

beerkson_paradox_with_selection_effects

YouTube and Podcasts

DOGE vs USAID, Crypto Framework, Google's $75B AI Spend, US Sovereign Wealth Fund, GLP-1s - All-In Podcast [Link]

DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters - Lex Fridman Podcast #459 [Link] [Transcript]

This is a very good one. 5 hours intro and overview of current AI landscape.

Those driver jobs weren't even there 10 years ago. Uber came along and created all these driver jobs. DoorDash created all these driver jobs. So what technology does—yes, technology destroys jobs—but it replaces them with opportunities that are even better. And then, either you can go capture that opportunity yourself, or an entrepreneur will come along and create something that allows you to capture those opportunities. AI is a productivity tool. It increases the productivity of a worker; it allows them to do more creative work and less repetitive work. As such, it makes them more valuable. Yes, there is some retraining involved, but not a lot. These are natural language computers—you can talk to them in plain English, and they talk back to you in plain English. But I think David is absolutely right. I think we will see job creation by AI that will be as fast or faster than job destruction. You saw this even with the internet. Like, YouTube came along—look at all these YouTube streamers and influencers. That didn’t used to be a job. New jobs—really, opportunities—because 'job' is the wrong word. 'Job' implies someone else has to give it to me, like they're handed out, as if it's a zero-sum game. Forget all that—it's opportunities. After COVID, look at how many people are making money by working from home in mysterious little ways on the internet that you can't even quite grasp. - Naval Ravikant

you know as long as you remain adaptive and you keep learning and you learn how to take advantage of these tools you should do better and if you wall yourself off from the technology and don't take advantage of it that's when you put yourself at risk - David Sacks

If you trained on the open web, your model should be open source. – Naval Ravikant.

To keep the conversation moving, let me segue a point that came up that was really important into tariffs. And the point is, even though the internet was open, the U.S. won a lot of the internet—a lot of U.S. companies won the internet. And they won that because we got there "the firstest with the mostest," as they say in the military. And that matters because a lot of technology businesses have scale economies and network effects underneath, even basic brand-based network effects. If you go back to the late '90s and early 2000s, very few people would have predicted that we would have ended up with Amazon basically owning all of e-commerce. You would have thought it would have been perfect competition and very spread out. And that applies to how we end up with Uber as basically one taxi service or how we end up with Airbnb. Meta—Airbnb—it's just network effects, network effects, network effects rule the world around me. But when it comes to tariffs and when it comes to trade, we act like network effects don't exist. The classic Ricardian comparative advantage dogma says that you should produce what you're best at, I produce what I'm best at, and we trade. And then, even if you want to charge me more for it—if you want to impose tariffs for me to ship to you—I should still keep tariffs down because I'm better off. You're just selling me stuff cheaply—great. Or if you want to subsidize your guys—great, you're selling me stuff cheaply. The problem is, that is not how most modern businesses work. Most modern businesses have network effects. As a simple thought experiment, suppose that we have two countries, right? I'm China, you're the U.S. I start out by subsidizing all of my companies and industries that have network effects. So I'll subsidize TikTok, I'll ban your social media but push mine. I will subsidize my semiconductors, which tend to have winner-take-all dynamics in certain categories. Or I'll subsidize my drones and then, exactly—BYD, self-driving, whatever. And then, when I win, I own the whole market and I can raise prices. And if you try to start up a competitor, it's too late—I've got network effects. Or if I've got scale economies, I can lower my price to zero, crash you out of business, no one in their right mind will invest, and then I'll raise prices right back up. So you have to understand that certain industries have hysteresis, or they have network effects, or they have economies of scale—and these are all the interesting ones. These are all the high-margin businesses. So in those, if somebody is subsidizing or they're raising tariffs against you to protect their industries and let them develop, you do have to do something. You can't just completely back down. - Naval Ravikant

I think Sam and his team would do better to leave the nonprofit part alone, leave an actual independent nonprofit board in charge, and then have a strong incentive plan and a strong fundraising plan for the investors and the employees. So I think this is workable. It's just that trying to grab it all seems way off, especially when it was built on open algorithms from Google, open data from the web, and nonprofit funding from Elon and others. - Naval Ravikant

― JD Vance's AI Speech, Techno-Optimists vs Doomers, Tariffs, AI Court Cases with Naval Ravikant - All-In Podcast [Link]

"AI won't take your job; it's someone using AI that will take your job." – Richard Baldwin. The discussion around AI's impact on jobs is often framed as a zero-sum game, but the reality is more nuanced. While AI will displace certain jobs (e.g., self-driving cars replacing drivers), it will also create new opportunities and industries that we can't yet fully envision. The key is adaptability—those who learn to use AI tools will thrive, while those who resist will fall behind.

The Stablecoin Future, Milei's Memecoin, DOGE for the DoD, Grok 3, Why Stripe Stays Private - All-In Podcast [Link]

How to build full-stack apps with OpenAI o1 pro - Part 1 - Mckay Wrigley [Link]

Learn app development using OpenAI o1-Pro with a structured six-prompt workflow.

Open Deep Research - LangChain [Link]

Build and run a deep research agent with LangGraph Studio, customize configurations, compare architectures, and analyze costs.

Paper and Reports

Probabilistic weather forecasting with machine learning [Link]

GenCast's success stems from its ability to generate ensembles of sharp, realistic weather trajectories and well-calibrated probability distributions.

The methodology of GenCast involves several key components:

  • GenCast employs a second-order Markov assumption, meaning it conditions its predictions on the two previous weather states, rather than just one. This is done because conditioning on two previous time steps works better than one.
  • GenCast is implemented as a conditional diffusion model. Diffusion models are generative machine learning methods that can model the probability distribution of complex data and generate new samples by iteratively refining a noisy initial state. The model predicts a residual with respect to the most recent weather state. The sampling process begins with random noise, which is then refined over a series of steps.
  • At each step of the iterative refinement process, GenCast uses a denoiser neural network. This network is trained to remove noise that has been artificially added to atmospheric states. The architecture of the denoiser includes an encoder, a processor, and a decoder. The encoder maps the noisy target state to an internal representation on a refined icosahedral mesh, the processor is a graph transformer, and the decoder maps the internal mesh representation back to a denoised target state.
  • GenCast uses a noise distribution that respects the spherical geometry of global weather variables. Rather than using independent and identically distributed (i.i.d.) Gaussian noise on the latitude-longitude grid, it samples isotropic Gaussian white noise on the sphere and projects it onto the grid.
  • GenCast's performance is evaluated using various metrics, including:
    • CRPS (Continuous Ranked Probability Score): Measures the skill of a probabilistic forecast.
    • RMSE (Root Mean Squared Error): Measures how closely the mean of an ensemble of forecasts matches the ground truth.
    • Spread/Skill Ratios and Rank Histograms: Used to evaluate the calibration of the forecast distributions.
    • Brier Skill Score: Evaluates probabilistic forecasts of binary events, specifically the prediction of extreme weather events.
    • Relative Economic Value (REV): Characterizes the potential value of a forecast over a range of probability decision thresholds.
    • Spatially Pooled CRPS: Evaluates forecasts aggregated over circular spatial regions of varying sizes to assess the model's ability to capture spatial dependencies.
    • Regional Wind Power Forecasting: Evaluates the model's ability to predict wind power generation at wind farm locations using a standard idealized power curve.
    • Tropical Cyclone Track Prediction: Uses the TempestExtremes tropical cyclone tracker to extract cyclone trajectories from the forecast and analysis data. The model's ability to forecast cyclone tracks is evaluated using position error and track probability.

The United States currently leads the world in data centers and AI compute, but unprecedented demand leaves the industry struggling to find the power capacity needed for rapidly building new data centers. Failure to address current bottlenecks may compel U.S. companies to relocate AI infrastructure abroad, potentially compromising the U.S. competitive advantage in compute and AI and increasing the risk of intellectual property theft.

― AI's Power Requirements Under Exponential Growth - RAND [Link] [pdf]

Genome modeling and design across all domains of life with Evo 2 - Arc Institute [Link]

Evo 2 is a powerful genome modeling and design tool that operates across all domains of life. It can analyze and generate genetic sequences from molecular to genome scale. It accurately assigns likelihood scores to human disease variants, distinguishing between pathogenic and benign mutations in both coding and noncoding regions. It can predict whether genes are essential or nonessential using mutational likelihoods, helping in bacterial and phage gene essentiality studies. It can generate large-scale DNA sequences with structured features like tRNAs, promoters, and genes with intronic structures. It provides zero-shot fitness predictions for protein and non-coding RNA sequences, correlating well with experimental fitness measurements. It robustly predicts the pathogenicity of various mutation types, achieving state-of-the-art performance for noncoding and splice variants.

Large Action Models: From Inception to Implementation [Link]

Microsoft Research published one of the most complete papers in this area, outlining a complete framework for large action models (LAMs) models. The core idea is to bridge the gap between the language understanding capability of LLMs and the need for real-world action execution.

microsoft_lam

Towards an AI co-scientist [Link]

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL [Link]

Direct Preference Optimization (DPO) does not consistently improve performance in the Text-to-SQL task and sometimes even degrades it. Existing Standard Fine-Tuning (SFT) methods are limited by the lack of high-quality training data, and prompting-based methods are expensive, slow, and raise data privacy concerns.

To solve the problems, they generate synthetic CoT solutions to improve training datasets, leading to more stable and significant performance improvements in DPO. They integrate execution-based feedback to refine the model’s SQL generation process, making the optimization process more reliable. And they create a quadruple-based preference dataset to help the model learn to distinguish between correct and incorrect SQL responses more effectively.

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking [Link]

Google DeepMind developed an innovative approach - Myopic Optimization with Non-myopic Approval (MONA), to mitigate multi-step reward hacking. This MONA methodology is built on two key principles. The first is myopic optimization, where agents focus on maximizing rewards for immediate actions rather than planning multi-step strategies. This ensures that agents do not develop complex, unintelligible tactics. The second principle is non-myopic approval, where human overseers assess the agent's actions based on their expected long-term utility. These evaluations serve as the primary mechanism for guiding agents toward behavior aligned with human-defined objectives, without relying on direct feedback from outcomes.

google_deepmind_mona

The Ultra-Scale Playbook: Training LLMs on GPU Clusters - Hugging Face [Link]

This book from Hugging Face explains 5D parallelism, ZeRO, CUDA kernel optimizations, and compute-communication overlap in large-scale AI training. It breaks down scaling bottlenecks, PyTorch internals, and parallelism techniques like ZeRO-3, pipeline, sequence, and context parallelism.

Articles and Blogs

The research found six distinct leadership styles, each springing from different components of emotional intelligence. The styles, taken individually, appear to have a direct and unique impact on the working atmosphere of a company, division, or team, and in turn, on its financial performance. And perhaps most important, the research indicates that leaders with the best results do not rely on only one leadership style; they use most of them in a given week—seamlessly and in different measure—depending on the business situation. Imagine the styles, then, as the array of clubs in a golf pro’s bag. Over the course of a game, the pro picks and chooses clubs based on the demands of the shot. Sometimes he has to ponder his selection, but usually it is automatic. The pro senses the challenge ahead, swiftly pulls out the right tool, and elegantly puts it to work. That’s how high-impact leaders operate, too.

Leaders who have mastered four or more—especially the authoritative, democratic, affiliative, and coaching styles—have the very best climate and business performance.

The leader can build a team with members who employ styles she lacks.

― Leadership That Gets Results - Harvard Business Review [Link]

WeatherNext - Google DeepMind [Link]

GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy - Google DeepMind [Link]

Morgan Stanley stated that ASICs perform exceptionally well in certain specific application scenarios, but are highly dependent on the custom needs of particular clients; the development cost of ASICs is usually lower, but their system costs and Software deployment costs may be much higher than GPUs that can be commercially scaled, leading to a higher total cost of ownership. In addition, NVIDIA's CUDA ecosystem is very mature and widely used in Global Cloud Computing Services, with a market position that remains as solid as ever.

― Morgan Stanley: ASICs are overheated, and NVIDIA's position is difficult to shake. - moomoo [Link]

NVIDIA possesses a robust competitive advantage in the AI chip market due to its mature ecosystem, continuous R&D investments, and strong technical capabilities.

  • NVIDIA's CUDA ecosystem is well-established, enabling clients to easily deploy and run various workloads. The maturity of this ecosystem means that customers may find it easier to use NVIDIA products compared to adapting software for ASICs or other alternatives.
  • NVIDIA has a leading position in the AI chip market, which is reinforced by its presence on every cloud platform across the globe. Investments within NVIDIA's ecosystem benefit from global dissemination, further solidifying its market dominance.
  • NVIDIA invests significantly in R&D. The company is expected to invest approximately \(\$16\) billion in R&D this year. This level of investment allows NVIDIA to maintain a 4-5 year development cycle and continuously introduce leading high-performance chips. Custom ASIC development budgets are typically smaller (less than \(\$1\) billion), giving NVIDIA an edge in innovation.
  • NVIDIA is difficult to surpass in providing high-end training capabilities. The company focuses on training multi-modal AGI models.

This means DeepResearch can identify cross-domain links or examples that might otherwise be overlooked, offering fresh perspectives. In professional settings, this can support more well-rounded decision-making – for example, a product manager can quickly gather insights from scientific research, market data, and consumer opinions in one place, rather than relying on multiple teams or lengthy research processes. It makes you multifaceted!

― #87: Why DeepResearch Should Be Your New Hire - Turing Post [Link]

Deep Research and Knowledge Value - Stratechery [Link]

OpenAI launched Deep Research in ChatGPT, which is an agentic capability that conducts multi-step research on the internet for complex tasks. It synthesizes knowledge in an economically valuable way but does not create new knowledge.

As demonstrated in the article, it can be useful for researching people and companies before conducting interviews. However, it can also produce reports that are completely wrong by missing major entities in an industry.

This is a good point - The Internet revealed that news was worthless in terms of economic value because the societal value does not translate to economic value. Deep Research reveals how much more could be known, but the increasing amount of "slop" makes it more difficult to find the right information. Information that matters and is not on the Internet has future economic value wrapped up in it.

Proprietary data is valuable, and AI tools like Deep Research make it more difficult to harvest alpha from reading financial filings. Prediction markets may become more important as AI increases the incentive to keep things secret.

As a summary of the impact - Deep Research is a good value, but it is limited by the quality of information on the Internet and the quality of the prompt. There is value in the search for and sifting of information, and this may be lost with on-demand reports. AI will replace knowledge work. Secrecy is a form of friction that imposes scarcity on valuable knowledge. Deep Research is not yet good at understanding some things.

Massive Foundation Model for Biomolecular Sciences Now Available via NVIDIA BioNeMo - NVIDIA Blog [Link]

Grok 3: Intelligence, Performance & Price Analysis - Artificial Analysis [Link]

Grok resets the AI race - The Verge [link]

Grok-3 (chocolate) is the first-ever model to break 1400 score and is now #1 in Arena.

Grok 3 Beta — The Age of Reasoning Agents - Grok Blog [Link]

Motivated by unmet needs in the modern scientific discovery process and building on recent AI advances, including the ability to synthesize across complex subjects and to perform long-term planning and reasoning, we developed an AI co-scientist system. The AI co-scientist is a multi-agent AI system that is intended to function as a collaborative tool for scientists. Built on Gemini 2.0, AI co-scientist is designed to mirror the reasoning process underpinning the scientific method. Beyond standard literature review, summarization and “deep research” tools, the AI co-scientist system is intended to uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and tailored to specific research objectives.

― Accelerating scientific breakthroughs with an AI co-scientist - Google Blog [Link]

AICoScientist-1-Components

An Interview with Uber CEO Dara Khosrowshahi About Aggregation and Autonomy - Stratechery [Link]

OpenAI Deep Research Guide - DAIR.AI [Link]

This is a super helpful guide to Deep Research Tools.

You provide use cases and examples, very inspiring, and tips, etc.

usecases_examples
claude_deepresearch_usecases_chart

Studies on the brain affirm the benefits of Tom’s visualization technique: Imagining something in vivid detail can fire the same brain cells actually involved in doing that activity. The new brain circuitry appears to go through its paces, strengthening connections, even when we merely repeat the sequence in our minds. So to alleviate the fears associated with trying out riskier ways of leading, we should first visualize some likely scenarios. Doing so will make us feel less awkward when we actually put the new skills into practice.

― Primal Leadership: The Hidden Driver of Great Performance - Harvard Business Review [Link]

Imagine it, fake it, and make it.

Our research tells us that three conditions are essential to a group’s effectiveness: trust among members, a sense of group identity, and a sense of group efficacy.

― Building the Emotional Intelligence of Groups - Harvard Business Review [Link]

Team is so important to leaders.

Interrupt the ascent.

When people are continually promoted within their areas of expertise, they don’t have to stray far from their comfort zones, so they seldom need to ask for help, especially if they’re good problem solvers. Accordingly, they may become overly independent and fail to cultivate relationships with people who could be useful to them in the future. What’s more, they may rely on the authority that comes with rank rather than learning how to influence people. A command-and-control mentality may work in certain situations, particularly in lower to middle management, but it’s usually insufficient in more senior positions, when peer relationships are critical and success depends more on the ability to move hearts and minds than on the ability to develop business solutions.

― The Young and the Clueless - Harvard Business Review [Link]

Don't fall into the independence trap.

Accelerating scientific breakthroughs with an AI co-scientist - Google Research [Link]

News and Comments

Introducing deep research - Open AI [Link]

Introducing Perplexity Deep Research - Perplexity [Link]

Trend is on deep research.

Shopify Tells Employees to Just Say No to Meetings - Bloomberg [Link]

Who will control the future of AI? - The Washington Post [Link]

Sam promotes a U.S.-led strategy to ensure AI development aligns with democratic values and remains under the leadership of the U.S. and its allies.

This new architecture used to develop the Majorana 1 processor offers a clear path to fit a million qubits on a single chip that can fit in the palm of one’s hand.

Microsoft is now one of two companies to be invited to move to the final phase of DARPA’s Underexplored Systems for Utility-Scale Quantum Computing (US2QC) program – one of the programs that makes up DARPA’s larger Quantum Benchmarking Initiative – which aims to deliver the industry’s first utility-scale fault-tolerant quantum computer, or one whose computational value exceeds its costs.

― Microsoft’s Majorana 1 chip carves new path for quantum computing - Microsoft [Link]

In the near term, Google’s approach with superconducting qubits (like Willow) is more mature. This technology has already demonstrated impressive benchmarks and is backed by years of incremental improvements. Its error correction techniques, while still challenging, are well‑studied, and scaling up using transmon qubits is an area where significant progress has been made.

On the other hand, Microsoft’s topological approach with Majorana 1 aims to use a completely new type of qubit—one that is “protected by design” thanks to its topological nature. In theory, this means lower error rates and potentially a much more scalable architecture with fewer physical qubits needed per logical qubit. However, this method is still very experimental, and questions remain over whether true Majorana zero modes have been reliably created and controlled.

In summary, for near‑term practical applications, Google’s path appears to be the safer bet. But if Microsoft’s topological qubit platform can overcome its technical hurdles, it may ultimately provide a more efficient and scalable route to fault‑tolerant quantum computing.

Tencent's Weixin app, Baidu launch DeepSeek search testing - Reuters [Link]

OpenAI tries to ‘uncensor’ ChatGPT - Techcrunch [Link]

Elon Musk Ally Tells Staff ‘AI-First’ Is the Future of Key Government Agency - WIRED [Link]

Thomas Shedd, a former Tesla engineer and ally of Elon Musk, is implementing an "AI-first strategy" at the General Services Administration's Technology Transformation Services (TTS). Shedd envisions the agency operating like a software startup, automating tasks and centralizing federal data. This shift is causing concern among GSA staff, who report being thrown into unexpected meetings and facing potential workforce cuts. Shedd is promoting collaboration between TTS and the United States Digital Services (DOGE), though specifics about the new AI-driven projects and data repository remain unclear. A cybersecurity expert expressed concern that automating government tasks is difficult and the attempt is raising red flags. Employees also voiced concerns regarding working hours and potential job losses.

OpenAI o3-mini - OpenAI [Link]

Introducing deep research - OpenAI [Link]

Gemini 2.0 is now available to everyone - Google DeepMind [Link]

The all new le Chat: Your AI assistant for life and work - Mistral AI [Link]

My thoughts regarding the AI landscape at the current stage:

As open-source AI becomes more affordable, it is poised to become as ubiquitous and accessible as electricity—financially viable for everyone. The AI and AGI arms race, whether between nations, open- and closed-source models, or competing companies, is effectively over or should be over, and the outcome is clear. Compute power still remains essential, but semiconductor giants like NVIDIA should look beyond language model training and inference, shifting their focus to the next frontiers, such as robotics and world models. Now is the time for developers and startups to concentrate on the vertical integration of AI, where real economic value can be realized.

DeepSeek - Background

DeepSeek began as a research offshoot of High-Flyer—a hedge fund that had already amassed a large GPU inventory (reportedly 10,000 Nvidia A100s in 2021). Over time, this resource base appears to have grown, with estimates suggesting that—when you account for research, ablation experiments, and shared infrastructure with trading—the effective pool might be closer to 50,000 GPUs. This expansive compute power enables DeepSeek to run many experiments simultaneously and quickly iterate on new architectures.

By leveraging a shared infrastructure with its hedge fund operations, DeepSeek can reinvest profits from quant trading into AI research. This model of “doing more with less” not only challenges the notion that massive, multibillion-dollar compute expenditures are necessary to build world-class AI models but also has broader implications for the industry. It raises questions about the future economics of AI development and the potential for more cost-efficient, research-driven models to shift market dynamics, as seen by the notable impact on Nvidia’s stock and market sentiment.

Export Controls on GPUs to China

In essence, the U.S. government originally imposed limits on chips that exceed certain thresholds in both interconnect bandwidth and compute (FLOPs) to restrict China’s ability to train massive AI models. Early on, chips that combined high interconnect speeds with high FLOPs were off‐limits.

For example, the H100—one of Nvidia’s top GPUs—was deemed too powerful. In response, Nvidia developed the H800, which maintained the same floating point performance (FLOPs) as the H100 but had its interconnect bandwidth intentionally reduced to meet U.S. export criteria. However, when the government later decided to tighten controls further (targeting chips solely on FLOPs), even the H800 was banned. This led Nvidia to innovate once again with the H20, a chip that now offers full interconnect bandwidth (and even improved memory characteristics over the H100) but with a deliberate cut in overall FLOPs to satisfy export rules.

The strategic rationale behind these controls is to “decap” China’s compute—especially for large-scale AI training—by limiting how many of the most advanced GPUs (and thus the overall density of compute) can be legally acquired. While Chinese companies can still purchase GPUs to train models, the overall capacity available for training (which is critical for developing super-powerful AI) is being capped. This is seen as a way to maintain U.S. and allied leadership in AI, particularly in a world where super-powerful AI may soon offer decisive military and economic advantages.

Sidenote - GPUs for AI

Keys GPU Specifications:

  • FLOPS (Compute Power): Critical for training large models (e.g., GPT-4) but less critical for inference tasks like reasoning.
  • Memory Bandwidth/Capacity: Determines how much data (e.g., KV cache in transformers) can be stored and accessed quickly, crucial for long-sequence tasks.
  • Interconnect Speed: Affects communication between GPUs in clusters, important for distributed training but less regulated now.

H20 vs. H100: Tradeoffs for AI Workloads:

  • H20 (China-Specific): has its strength in higher memory bandwidth and capacity than H100, making it better suited for reasoning tasks (e.g., long-context inference, chain-of-thought). However, FLOPS (≈1/3 of H100 on paper, ≈50-60% in practice) is reduced, limiting its utility for training.

    Regulatory Context: Designed to comply with U.S. export controls that focus on FLOPS, allowing Nvidia to ship 1M units to China in 2023 (20-25% of total GPUs).

  • H100: Optimized for FLOPS-heavy training but less efficient for memory-bound inference tasks

Why Memory Matters for Reasoning:

  • KV Cache in Transformers stores keys/values of all tokens in a sequence for attention mechanisms. Memory demands grow quadratically with sequence length (e.g., 10K+ tokens in reasoning tasks).
  • Autoregressive Generation: Output tokens require sequential processing, forcing repeated KV cache access. This limits parallelism and increases memory pressure. Tasks like agentic AI or chain-of-thought involve generating long outputs (10K+ tokens), stressing memory bandwidth/capacity.

DeepSeek - Technical Comments

Paper: DeepSeek-V3 Technical Report

GPU Infrastructure with Nvidia Hardware

  • DeepSeek trains on Nvidia GPUs. These are equipped with many cores (organized into streaming multiprocessors, or SMs) that perform the heavy lifting during both training and inference.
  • The GPUs they used were those legally available in China, which imposed certain limitations—especially on interconnect bandwidth between units. This meant that DeepSeek needed to overcome hardware constraints that might not be present with the very latest high-end GPUs elsewhere.

Custom Low-Level Optimization

  • Instead of relying solely on Nvidia’s standard NCCL (Nvidia Communications Collectives Library) for handling inter-GPU communications, DeepSeek’s engineers developed custom scheduling techniques. They even scheduled communications at the SM level, which is more granular than the typical approach.
  • Their implementation involved programming approaches that went deep into the hardware—down to using PTX (an intermediate assembly-like language for CUDA). This allowed them to squeeze extra efficiency from each GPU by reducing the overhead in communication between layers of the model.

Efficiency via Architectural Choices

  • One of the key innovations was using a sparse Mixture of Experts (MoE) architecture. With a model that can have hundreds of billions of parameters overall but only activates a fraction (e.g., around 37 billion at a time), the compute and memory demands are dramatically reduced. This architectural choice means that even if the hardware isn’t the absolute latest, it can still be very cost-effective by not needing to run every parameter for every token.
  • DeepSeek's novel attention mechanism MLA (Multi-Head Latent Attention) reduces memory usage by 80–90% compared to traditional transformer attention. This optimization lowers computational costs, especially for long-context processing, without sacrificing performance.
  • By optimizing both the hardware usage (through custom scheduling and low-level programming) and the model architecture (via MoE and MLA), DeepSeek manages to cut down on the cost of training. This is crucial given the significant compute expense associated with large-scale language models.

Pre-Training and Context Window Extension

  • Pre-trained on 14.8 trillion tokens drawn from a multilingual corpus (primarily English and Chinese) with a higher proportion of math and programming content compared to previous iterations.
  • Utilizes a two-phase extension (via the YaRN framework) to expand the context length from 4K tokens to 32K and finally to 128K tokens.
  • Reported training cost for V3 is approximately $5.58 million, consuming about 2.788 million GPU-hours on Nvidia H800 GPUs. This figure is significantly lower than the hundreds of millions typically reported by US rivals.

Post-Training: Supervised Fine-Tuning & Reinforcement Learning

  • V3 is fine-tuned on a carefully curated dataset of approximately 1.5 million examples (both reasoning and non-reasoning tasks) to improve instruction-following and output formatting.
  • DeepSeek employs GRPO—a group relative policy optimization method—to reward outputs based on correctness (accuracy rewards) and presentation (format rewards).
  • R1 leverages RL to fine-tune the reasoning process, rewarding chain-of-thought quality and encouraging the model to generate self-reflective “aha moments.”

Speed-to-Market and Safety Tradeoffs

  • DeepSeek prioritizes rapid deployment over extensive safety testing, avoiding delays and costs associated with ethical reviews (common in Western firms like Anthropic). This "ship-first" approach reduces development cycle expenses.

  • Releasing model weights publicly attracts third-party hosting and innovation, indirectly expanding reach without bearing full infrastructure costs.

The Tech and Business Perspective

The release of DeepSeek-R1 marks a pivotal moment in the AI industry, igniting discussions about open-source dominance, market disruption, and geopolitical implications.

Industry Leaders Weigh In:

Yann LeCun (Meta’s Chief AI Scientist)

LeCun emphasized the growing power of open-source models over proprietary approaches:

"To people who see the performance of DeepSeek and think China is surpassing the US in AI. You are reading this wrong. The correct reading is: Open source models are surpassing proprietary ones."

Andrej Karpathy (OpenAI Co-founder)

Karpathy pointed out the continued need for large-scale computing while praising DeepSeek’s efficiency:

"Does this mean you don't need large GPU clusters for frontier LLMs? No, but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms."

Satya Nadella (Microsoft CEO)

Nadella underscored the significance of DeepSeek, highlighting its role in making AI reasoning more accessible:

"We should take the developments out of China very, very seriously." "DeepSeek has had some real innovations. … Obviously, now all that gets commoditized." "When token prices fall, inference computing prices fall, that means people can consume more, and there will be more apps written."

Mark Zuckerberg (Meta CEO)

Zuckerberg acknowledged DeepSeek's novel infrastructure optimizations:

"DeepSeek had a few pretty novel infrastructure optimization advances, which, fortunately, they published them, so we can not only observe what they did, but we can read about it and implement it, so that'll benefit us." "Always interesting when there's someone who does something better than you. Let's make sure we are on it."

Aravind Srinivas (Perplexity AI CEO)

Srinivas stressed the importance of foundational innovation:

"We need to build, not just wrap existing AI."

Marc Andreessen (Andreessen Horowitz Co-founder)

He likened DeepSeek-R1 to a historic milestone:

"DeepSeek R1 is AI's Sputnik moment."

Tim Cook (Apple CEO)

Cook gave a measured response during an earnings call:

"In general, I think innovation that drives efficiency is a good thing."

Academic and Research Perspectives

AI Researchers on DeepSeek-R1:

Timnit Gebru (AI Ethics Researcher)

Gebru reflected on past AI development priorities:

"At Google, I asked why they were fixated on building THE LARGEST model. Why are you going for size? What function are you trying to achieve? They responded by firing me."

Ethan Mollick (Wharton AI Professor)

Mollick focused on accessibility rather than capabilities:

"DeepSeek is a really good model, but it is not generally a better model than o1 or Claude. But since it is both free and getting a ton of attention, I think a lot of people who were using free 'mini' models are being exposed to what an early 2025 reasoner AI can do and are surprised."

Andrew Ng (AI Researcher and Entrepreneur)

Ng saw the market reaction as an opportunity for developers:

"Today's 'DeepSeek selloff' in the stock market—attributed to DeepSeek V3/R1 disrupting the tech ecosystem—is another sign that the application layer is a great place to be. The foundation model layer being hyper-competitive is great for people building applications."

Global Academic Community Response:

Huan Sun from Ohio State University noted that DeepSeek's affordability is expanding LLM adoption in research. Cong Lu from the University of British Columbia highlighted R1’s rapid adoption, surpassing 3 million downloads on Hugging Face in a week. Meanwhile, safety concerns emerged as studies revealed R1 is 11 times more likely to generate harmful content compared to OpenAI models, prompting calls for better safeguards.

Impact Discussion

Market and Industry Impact

The release of DeepSeek-R1 caused massive shifts in financial markets. U.S. tech stocks collectively lost \(\$1\) trillion, with Nvidia suffering record losses due to the rising competition from this cost-efficient model. Investors are recalibrating AI development strategies as DeepSeek achieved comparable performance to OpenAI’s models at just \(\$6\) million versus OpenAI’s \(\$100\) million.

Integration into Cloud Ecosystems

AWS and Microsoft Azure have incorporated DeepSeek-R1, enabling developers to explore its capabilities securely and cost-effectively. The emergence of cost-effective models like DeepSeek R1 is forcing a shift in AI economics, emphasizing efficiency over massive capital investments. As a result, competition in the AI sector is intensifying, ushering in a “warring states era” where companies are scrambling for innovation in cost-effective models.

Geopolitical and National Security Implications

The success of DeepSeek R1 has intensified concerns that the U.S. is losing its technological edge to China. Policymakers are reassessing export controls on advanced chips in light of DeepSeek's ability to innovate using restricted hardware. Security concerns have also prompted the U.S. Navy to ban the use of DeepSeek R1 due to potential security and ethical risks, fueling debates over the implications of adopting foreign-developed AI systems.

Open-Source vs Proprietary Models

DeepSeek R1 is accelerating the democratization of AI by lowering barriers for smaller developers and researchers, fostering innovation. However, transparency concerns remain as DeepSeek has not disclosed its training data, raising ethical and bias-related questions.

Ethical and Technical Questions

Concerns have emerged regarding potential censorship, as some versions of DeepSeek R1 appear to align with Chinese narratives. Additionally, skepticism exists over whether DeepSeek’s reported costs and capabilities are fully accurate, with some experts questioning the factors that contributed to its success.

Public Sentiment and the Future of AI

Public reaction to DeepSeek-R1 has been mixed. Some view this as a “Sputnik moment,” encouraging U.S. firms to accelerate AI innovation while leveraging open-source models to stay competitive. Others see it as a wake-up call, with former President Donald Trump urging U.S. industries to adapt quickly to maintain leadership in AI development.

Persistence in LangGraph

Persistence is a cornerstone for building robust and production-grade applications. LandGraph introduces a game-changing feature that ensures application states are stored and retrievable at any point. This redefines reliability and scalability in workflow management. This capability is especially vital when executing workflows involving interruptions, user inputs, or debugging. Whether you're building a simple app or an enterprise-grade system, persistence ensures your application is always ready to handle interruptions and user interactions gracefully.

The "Persisting Agent Stage" enables seamless workflows, especially in user-facing applications. Here’s why this feature is critical:

  1. Human-in-the-Loop Workflows: Many applications rely on user input to make decisions or advance processes. With persistence, LandGraph allows the graph execution to pause, checkpoint the state into persistent storage, and resume later. This means the application can wait for user input and continue without losing context.
  2. Debugging and History: Persistence creates a robust mechanism for saving the application state after every step. This makes debugging easier and enables the creation of detailed execution histories.
  3. Support for Multi-Session Scenarios: Applications often require users to switch between sessions while maintaining their progress. Persistence ensures continuity by saving states into persistent storage.

At the heart of this feature is the CheckPointer object, a persistence layer implemented by LandGraph. Here’s how it works:

  • Integration with Databases The CheckPointer can save states into various database types, including:

    • Document databases: Firestore, MongoDB
    • Relational databases: PostgreSQL, SQLite, MySQL
    • Graph databases: Neo4j, AWS Neptune

    For example, the following section will focus on persisting states into an SQLite database, a popular choice for local environments. The process can also be extended to managed cloud databases like Google Cloud SQL or AWS RDS.

  • State Management As each node in the graph executes, the CheckPointer saves the updated state into the database. This ensures that states are recoverable after interruptions, enabling the graph to resume execution from exactly where it left off.

To implement persistence, follow these simple steps:

  1. Import the CheckPointer object from LandGraph.
  2. Create an instance of CheckPointer and configure it with a connection string (local or cloud-based database).
  3. Pass the CheckPointer instance to your graph during creation. LandGraph will handle state persistence automatically after each node execution.
1
2
3
4
from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":checkpoints.sqlite:")
graph = workflow.complie(checkpointer=memory)

The result is that you can pause the graph, fetch user input, and continue execution seamlessly, all while ensuring states are securely stored in your chosen database.

MemorySaver + Interrupts = Human In The Loop

Human-in-the-loop systems are essential to modern applications, allowing seamless integration of human feedback into automated workflows. With the help of the MemorySaver feature, you can build applications using LangGraph that pause, capture user input, and resume execution effortlessly.

In workflows involving human interaction, there are moments where the application needs to pause, gather feedback from the user, and then continue processing. For instance, consider a sequence of tasks where:

  1. A process executes its initial steps.
  2. The system pauses to collect human input.
  3. The workflow resumes, incorporating the user’s feedback.

This type of flow requires interrupts to halt the execution and persistence to save the current state of the workflow. Langraph provides the tools to manage both seamlessly.

Implementation

To illustrate, let’s build a straightforward graph with the following steps:

  1. Start with a simple initial node.
  2. Execute a task and pause for human feedback.
  3. Resume execution with the updated state and complete the workflow.

We use Langraph's MemorySaver, a checkpointing tool that saves the workflow’s state in memory after each node’s execution. This ephemeral storage method is perfect for local testing and prototyping. Here’s a simplified version of the setup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from dotenv import load_dotenv

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver


class State(TypedDict):
input: str
user_feedback: str


def step_1(state: State) -> None:
print("---Step 1---")


def human_feedback(state: State) -> None:
print("---human_feedback---")


def step_3(state: State) -> None:
print("---Step 3--")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)


memory = MemorySaver()

graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

The graph visualization by using Mermaid.ink is here:

hitl-graph

MemorySaver Implementations

Integrating human feedback into automated systems is a growing trend in AI development. It bridges the gap between machine automation and human judgment, enabling better decision-making, improved accuracy, and adaptability. In this section, we explore how to incorporate human-in-the-loop functionality into a graph-based system while leveraging memory storage to track execution states. This walkthrough showcases the process from initialization to final execution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from dotenv import load_dotenv

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver


class State(TypedDict):
input: str
user_feedback: str


def step_1(state: State) -> None:
print("### Step 1 ###")


def human_feedback(state: State) -> None:
print("### Human Feedback ###")


def step_3(state: State) -> None:
print("### Step 3 ###")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)


memory = MemorySaver()

graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

if __name__ == "__main__":
thread = {"configurable": {"thread_id": "1"}}

initial_input = {"input": "hello world"}

for event in graph.stream(initial_input, thread, stream_mode="values"):
print(event)

print(graph.get_state(thread).next)

user_input = input("How do you want to update the state? ")

graph.update_state(thread, {"user_feedback": user_input}, as_node="human_feedback")

print("### State after update ###")
print(graph.get_state(thread))

print(graph.get_state(thread).next)

for event in graph.stream(None, thread, stream_mode="values"):
print(event)

The graph’s execution is tied to a thread variable, a dictionary initialized with a thread_id. This serves as a session or conversation identifier, distinguishing various graph runs. For simplicity, the thread_id is set to 1, though a more robust implementation would use a UUID. The graph processes events using graph.stream(), which accepts the initial input and thread details. Events are streamed in value mode, and each event is printed for transparency.

During execution:

  • Input is processed.
  • Node executions are logged.
  • Interruptions allow for dynamic human input.

Running the graph in debug mode provides insights into:

  • Memory storage (memory.storage) containing nested objects that log the graph state.
  • Transition logs for each node, showing updates or lack thereof.

At an interrupt, human feedback is solicited using Python's built-in input() function. This input updates the state dynamically. Once human input is integrated, the graph resumes execution. Subsequent steps process the updated state, leading to the graph’s completion.

SqliteSaver

Switching from an ephemeral memory-based state saver to a persistent database saver can significantly enhance the durability and traceability of your graph’s execution. In this section, we’ll explore how to replace the in-memory MemorySaver with an SQLiteSaver for long-term storage and easy debugging.

The MemorySaver is transient, meaning all state information vanishes after the program stops. By using an SQLite database, you can:

  • Persist graph states across runs.
  • Debug and troubleshoot using a structured database.
  • Resume executions exactly where they were interrupted.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import sqlite3

from dotenv import load_dotenv
from langgraph.checkpoint.sqlite import SqliteSaver

load_dotenv()
from typing import TypedDict
from langgraph.graph import StateGraph, START, END


class State(TypedDict):
input: str
user_feedback: str


def step_1(state: State) -> None:
print("### Step 1 ###")


def human_feedback(state: State) -> None:
print("### Human Feedback ###")


def step_3(state: State) -> None:
print("### Step 3 ###")


builder = StateGraph(State)
builder.add_node("step_1", step_1)
builder.add_node("human_feedback", human_feedback)
builder.add_node("step_3", step_3)
builder.add_edge(START, "step_1")
builder.add_edge("step_1", "human_feedback")
builder.add_edge("human_feedback", "step_3")
builder.add_edge("step_3", END)

conn = sqlite3.connect("checkpoints.sqlite", check_same_thread=False)
memory = SqliteSaver(conn)
graph = builder.compile(checkpointer=memory, interrupt_before=["human_feedback"])

graph.get_graph().draw_mermaid_png(output_file_path="graph.png")

if __name__ == "__main__":
thread = {"configurable": {"thread_id": "1"}}

initial_input = {"input": "hello world"}

for event in graph.stream(initial_input, thread, stream_mode="values"):
print(event)

print(graph.get_state(thread).next)

user_input = input("How do you want to update the state: ")

graph.update_state(thread, {"user_feedback": user_input}, as_node="human_feedback")

print("### State after update ###")
print(graph.get_state(thread))

print(graph.get_state(thread).next)

for event in graph.stream(None, thread, stream_mode="values"):
print(event)

We start by importing the required modules. Then Initialize a connection to your SQLite database. The check_same_thread=False flag ensures thread-safe database operations, essential for stopping and restarting execution across different threads. After that we create an instance of SQLiteSaver and pass it the SQLite connection. This saver integrates seamlessly with the graph execution pipeline, persisting states to the SQLite database.

  1. Initial Execution: Run the graph with the SQLiteSaver. After execution, you’ll see a new file, checkpoints.sqlite, created in your project directory.
  2. Inspect the Database: Use your IDE’s database tools (e.g. SQLite3 Editor for VS Code) to load and inspect the checkpoints.sqlite file. You’ll find a table storing graph states, similar to what you’d see with MemorySaver, but now it’s persistent.
screenshot_sqlite_ide

Changing the thread_id allows you to simulate a new session while retaining access to previous runs. When resuming, the graph starts from the last recorded state. You can verify this by inspecting the database entries for the new thread_id.

For enhanced traceability, integrate Langsmith for tracking and debugging. Langsmith provides detailed insights, including thread metadata and execution traces.

I'm documenting a personal workflow for setting up the development of AI (Agent) applications.

1. Create a Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
📦advanced-rag-app
┣ 📂graph
┃ ┣ 📂chains
┃ ┃ ┣ 📂tests
┃ ┃ ┃ ┣ 📜__init__.py
┃ ┃ ┃ ┗ 📜test_chains.py
┃ ┃ ┗ 📜__init__.py
┃ ┣ 📂nodes
┃ ┃ ┗ 📜__init__.py
┃ ┣ 📜__init__.py
┃ ┣ 📜consts.py
┃ ┣ 📜graph.py
┃ ┗ 📜state.py
┣ 📜Pipfile
┗ 📜Pipfile.lock

Suppose this is a Python project for an advanced RAG application. Below is an explanation of the purpose of each components.

📂graph: This directory represents the core components of your RAG application's data flow or computational graph.

  1. 📂chains: Contains logic for chains, which might define the sequences or workflows in your RAG pipeline.
    • tests/: Contains unit tests to validate the behavior of the chains.
      • test_chains.py: Test file for the chain-related logic.
    • __init__.py: Makes the chains directory a Python module.
  2. 📂nodes: Represents individual components or steps in the graph. Nodes could be processing units, like transformers or embeddings, used within the graph.
    • __init__.py: Initializes the nodes module.
  3. consts.py: A place for defining constants (e.g., default values, thresholds, or configuration keys) used across the graph module.
  4. graph.py: Contains the main implementation of the graph structure, potentially orchestrating the flow of nodes and chains.
  5. state.py : Likely manages the state of the application or graph, such as caching intermediate results or tracking the flow through the graph.
  6. __init__.py: Initializes the graph module, potentially exposing key functions or classes to be used by other parts of the application.

📜Pipfile & 📜Pipfile.lock

  • Pipfile: Defines the project's Python dependencies and configurations, including development and production requirements.
  • Pipfile.lock: A generated file that locks the exact versions of the dependencies to ensure reproducibility.

2. Create a Virtual Environment

Pipenv is a powerful dependency management tool for Python that combines the functionality of pip and virtualenv into a single workflow. Designed to streamline Python development, Pipenv makes it easier to manage project-specific packages, virtual environments, and dependency conflicts.

1
pip3 install pipenv
1
2
3
4
5
# Pipenv might be installed, but your shell might not know where to find it.
# Add the bin directory to your PATH environment variable in your .zshrc file:
# Reload your .zshrc file.
echo 'export PATH="$PATH:$(python -m site --user-base)/bin"' >> ~/.zshrc
source ~/.zshrc
1
2
cd advanced-rag-app/
pipenv shell

Install commonly used packages.

1
2
3
4
5
6
7
8
9
10
11
pipenv install langchain
pipenv install langchain-openai
pipenv install langchain-community
pipenv install langchain-core
pipenv install langchainhub
pipenv install langgraph
pipenv install python-dotenv
pipenv install tavily-python
pipenv install langchain-chroma
pipenv install pytest
pipenv install black

LangChain: LangChain is a framework for building applications powered by LLMs. It provides tools for creating chains, agents, and retrieval strategies that form the cognitive architecture of applications. It supports tasks like decision-making, RAG, and more.

  • Chains: Pre-defined workflows combining multiple components.
  • Agents: Decision-making entities that select tools based on user input.
  • Memory: Maintains state across interactions for context-aware behavior.
  • Integration with external tools via modular packages.

langchain-core: This package contains the foundational abstractions and interfaces for LangChain. It defines the base components like LLMs, vector stores, retrievers, and tools.

  • Lightweight dependencies with no third-party integrations.

  • Provides the "Runnable" interface for consistent invocation of components (e.g., stream, invoke, batch).

  • Relationship: Forms the backbone of LangChain by offering core functionality upon which other packages (e.g., langchain, langchain-community) build.

langchain-community: A community-maintained package containing third-party integrations for LangChain. It includes connectors for external LLMs, vector databases, and other tools.

  • Optional dependencies to keep it lightweight.

  • Encourages community contributions to expand LangChain's ecosystem.

  • Relationship: Extends the functionality of LangChain by enabling integrations beyond the core package.

langchain-openai: A specific integration package for OpenAI models within LangChain.

  • Provides seamless interaction with OpenAI's GPT models.

  • Includes utilities to handle inputs/outputs specific to OpenAI's API.

  • Relationship: A standalone integration package that depends on langchain-core but focuses exclusively on OpenAI's offerings.

langchain-chroma: An integration between LangChain and Chroma, enabling seamless use of Chroma’s vector database capabilities within LangChain applications.

  • Simplifies local prototyping by eliminating the need for external servers and supports features like in-memory or persistent storage modes.

  • Supports operations such as adding, querying, and updating embeddings, making it developer-friendly for tasks like similarity search and document retrieval.

  • Relationship: Chroma complements LangChain by providing a lightweight, efficient vector store that integrates seamlessly. Together, they enable developers to prototype locally and scale AI applications effectively.

LangGraph: An extension of LangChain designed for building multi-agent systems and stateful workflows using graph-based coordination.

  • Models workflows as nodes and edges in a graph structure.

  • Supports cyclical graphs and advanced agent coordination.

  • Exposes interfaces for creating custom flows or common agent types.

  • Relationship: Depends on langchain-core while adding graph-based capabilities. Complements LangChain by enabling complex multi-step workflows.

LangChainHub: A repository or platform for sharing reusable LangChain components such as chains, prompts, and templates.

  • Centralized location for community-contributed resources.

  • Facilitates rapid prototyping by providing ready-to-use modules.

  • Relationship: Acts as an auxiliary resource to the LangChain framework, promoting collaboration and reuse among developers.

Relationships Summary

Component Dependency/Relation Purpose
langchain-core Foundation of all other packages Defines core abstractions and interfaces.
langchain Built on langchain-core Implements cognitive architecture (chains, agents).
langchain-community Extends langchain with third-party integrations Adds optional connectors for external tools.
langchain-openai Built on langchain-core Focuses exclusively on OpenAI model integration.
langchain-chrome Extends langchain-core Adds Chrome-based automation capabilities.
LangGraph Extends langchain-core Enables graph-based multi-agent workflows.
LangChainHub Independent but complementary Repository of reusable LangChain components.

3. Setup Debugger

Create a launch.json file by clicking on 'create a launch.json file' -> 'Python Debugger' -> Python File (Debug the currently active python file).

setup_debugger

Add one line to 'configurations'. This specifies a file that contains environment variable definitions. These variables will be loaded into the environment when debugging the application. Typically, this is a .env file.

1
"envFile": "${workspaceFolder}/.env"
launch_json

You can find launch.json in the directory /advanced-rag-app/.vscode/launch.json.

4. Setup Black Formatter

Open your VSCode settings by Command+Shift+P, go 'Preferences: Open User Settings'.

Search for "formatter" and select "black" as default formatter for editor.

black_default_formatter

Search for "format on save" and enable the "Editor: Format on Save" option.

format_on_save

For more information, please find Formatting Python in VS Code.

5. Configure Automatic Testing

1
pytest . -s -v

This is a way to run tests using the pytest testing framework in Python.

  1. pytest: Invokes the pytest framework to discover and run tests.

  2. .: Specifies the current directory as the location to look for test files.

    pytest will automatically search for files matching the naming conventions like test_*.py or *_test.py in the specified directory.

  3. -s: Instructs pytest to not capture standard output (stdout) during the test run.

    Without -s, pytest captures all output (e.g., print statements) and shows it only when a test fails.

    With -s, you can see print statements or other output in real time while the tests are running.

  4. -v: Stands for verbose mode.

    Provides more detailed output for each test case, including the test name, status (pass/fail), and sometimes additional context like line numbers. This is useful for debugging or understanding the progress of the test suite.

Go to 'Testing' -> 'Configure Python Tests' -> 'pytest (pytest framework)', and select the correct directory. You will find configurations created in .vscode/settings.json.

For more information, please find Python Testing in Visual Studio Code.

Substack

How Meta Plans To Crank the Dopamine Machine With Infinite AI-Generated Content - The Algorithmic Bridge [Link]

This article discussed AI’s most dangerous potential - its ability to manipulate and addict humans through hyper-targeted entertainment. This trend, spearheaded by companies like Meta, risks reshaping human cognition and agency, raising existential questions about freedom, pleasure, and the future of society.

One good point is that the killer robots are brain-hacking entertainment. A very plausible dystopia involves technology (e.g., AI-driven entertainment) manipulating human attention and cognition for profit. Traditional TV was a prototype of mental manipulation but lacked personalization. Current platforms such as Netflix and TikTok use algorithms to cater to preferences but still feel limited. Future AI will create hyper-personalized content tailored to individual preferences in real-time, exploiting human psychology. Meta’s generative AI plans are the next step toward addictive, manipulative entertainment. Meta announced that AI content creators will be designed to enhance engagement on platforms like Facebook and Instagram. Connor Hayes, Meta’s VP for generative AI, explained how AI accounts will create and share engaging content.

The Five Stages of AGI Grief - Marcus on AI [Link]

Marcus uses the framework of the Kübler-Ross model of grief to describe the emotional responses people are having (or will likely have) to the ongoing developments in Artificial General Intelligence (AGI). He argues that many people are not yet facing the reality of AGI and are likely to go through similar stages of grief as it gets closer.

  1. Denial: Many people, including some experts, are still in denial about the possibility and speed of AGI development. They dismiss the progress, underestimate its potential, or claim it's decades away.
  2. Anger: Once denial fades, anger emerges, often directed at those perceived as enabling or hyping AGI. This can be targeted at AI researchers, tech companies, or even the technology itself.
  3. Bargaining: In this stage, people try to find ways to control or mitigate AGI, often through unrealistic expectations or proposed solutions.
  4. Depression: As bargaining fails, a sense of profound unease and hopelessness may set in. This is the realization that AGI could fundamentally change society in ways that are difficult to predict or control, leading to feelings of powerlessness.
  5. Acceptance: This is the final stage, where people begin to accept the reality of AGI and its potential impact. This isn't necessarily cheerful, but it's characterized by a shift from denial and fear towards a more realistic view.

The Taiwan posts - Noahpinion [Link]

Disney Paid Off Trump for a Reason - BIG by Matt Stoller [Link]

Fubo, a sports streaming service, had previously won a preliminary injunction against a joint venture between Disney, Fox, and Warner Bros, arguing that the venture was an illegal merger. However, Fubo's stock wasn't performing well, leading Fubo CEO David Gandler to sell a controlling stake in his company to Disney.

Here are the rationales behind this decision, according to the sources:

  • Fubo's CEO, David Gandler, profited from winning an antitrust suit and joined forces with a large corporation. Instead of being an underdog fighting against major corporations, Fubo has now joined forces with one of them. Fubo will now have Disney's resources, while its leaders imagine that it will operate somewhat independently.
  • Disney made a $16 million payment for defamation against Trump, which is considered questionable by legal analysts, in order to gain credibility with Trump. The aim of this was to ensure that government enforcers would not interfere with the deal.
  • Fubo's leaders may be ignoring the risks involved in the merger. They are potentially exhibiting a kind of "malevolent naivete" and airbrushing away their own violation of the law.

The sources suggest that Fubo's leadership may not be considering some of the risks associated with mergers. Mergers carry significant risk, and they can fall apart for a variety of reasons. During the 18-24 months that it takes to clear financing and regulatory hurdles, a company under contract to be sold cannot make significant strategic decisions or investments, while the purchaser can do whatever they want. If the deal falls apart, the company that was to be sold could be in a significantly worse position.

The sources point out that there is a possibility that another private litigant could take Fubo's place and sue, using the legal precedent set by Fubo. This is evidenced by a letter sent by EchoStar to the court, in which the company states that it's considering suing along the same lines as Fubo. This may not matter to Disney, since they now control Fubo, but it should be a source of concern for Fubo's leadership team who have essentially bet their company on a violation of the law.

A private litigant, such as EchoStar, could take Fubo's place and sue Disney, Fox, and Warner Bros, using the same legal arguments that Fubo successfully used to win a preliminary injunction. This is a possibility because the legal precedent set by Fubo remains, even though Fubo is now under Disney's control.

Here's why this could be problematic for Fubo but not necessarily for Disney:

  • Fubo is in a vulnerable position due to the merger agreement. While the deal is pending, Fubo is restricted in its strategic decision-making and investments, effectively putting the company in "limbo". This means Fubo cannot make significant moves to respond to a new lawsuit.
  • Disney, as the purchaser, is not similarly restricted. They can continue to operate as they see fit. They have the resources to handle a new legal challenge.
  • If the merger fails, Fubo will have wasted 18-24 months with the potential for no significant strategic moves. It could end up in a weakened state compared to competitors who were not in a merger process. The company might even become "a limping and probably dead company". Failed mergers can also lead to leadership changes, such as the CEO getting fired.
  • Disney has already taken steps to ensure the deal's success, including a payment to gain credibility with the current administration. While another lawsuit could present a challenge, Disney has the resources and political connections to navigate it, which Fubo does not.
  • The incentive to complete the deal is different for Disney and Fubo. Disney will remain a major player regardless of the deal's outcome. However, Fubo's future is heavily dependent on the merger. This makes Fubo more vulnerable if the deal is challenged.

The rise and fall of "fact-checking" - Silver Bulletin [Link]

The main opinion of this article is that Meta's decision to replace fact-checkers with a community notes system is justifiable because fact-checkers have been politically biased and have not effectively addressed the issue of misinformation.

While the author agrees with Zuckerberg's decision, they also acknowledge that Zuckerberg's motivations may not be high-minded, but rather driven by political pressure and business incentives. Despite that, the author thinks the move is "pointing in the right direction," and agrees with Zuckerberg's claim that fact-checkers have been too politically biased. The author also admits their own biases and that Community Notes is a new program that might also have problems.

US Banks: Profits Surge - App Economy Insights [Link]

CES 2025: AI Takes Over - App Economy Insights [Link]

a16z's big ideas in tech for 2025 - ben lang's notes [Link]

Andreessen Horowitz’s list of big ideas in tech for 2025:

a16z_big_idea_2025

How AI-assisted coding will change software engineering: hard truths - The Pragmatic Engineer [Link]

Great article!

This "70% problem" suggests that current AI coding tools are best viewed as:

  • Prototyping accelerators for experienced developers
  • Learning aids for those committed to understanding development
  • MVP generators for validating ideas quickly

Current tools mostly wait for our commands. But look at newer features like Anthropic's computer use in Claude, or Cline's ability to automatically launch browsers and run tests. These aren't just glorified autocomplete - they're actually understanding tasks and taking initiative to solve problems.

Think about debugging: Instead of just suggesting fixes, these agents can:

  • Proactively identify potential issues
  • Launch and run test suites
  • Inspect UI elements and capture screenshots
  • Propose and implement fixes
  • Validate the solutions work (this could be a big deal)

― The 70% problem: Hard truths about AI-assisted coding - Elevate [Link]

Great pragmatic article! And it's well-said in the end: "Software quality was (perhaps) never primarily limited by coding speed...The goal isn't to write more code faster. It's to build better software. "

AI tools help experienced developers more than beginners. This is similar to the fact that AI helps top biologists to be successful more than normal biologists. The results and efficiency of AI usage differ based on users' domain expertise. This is called 'knowledge paradox'. AI can help to get the first 70% job done quickly, but the efforts on the final 30% have diminishing returns. This is called 'AI learning curve paradox'.

o1 isn’t a chat model (and that’s the point) - Latent Space [Link]

  • Provide Extensive Context: Give 10x more context than you think is necessary. This includes details about previous attempts, database schemas, and company-specific information. Think of o1 like a new hire that needs all the relevant information to understand the task. Put the context at the end of your prompt.

    Use tools like voice memos to capture context and paste transcripts. You can also save reusable segments of context for future use. AI assistants within other products can help extract context.

  • Focus on the Desired Output: Instead of telling o1 how to answer, clearly describe what you want the output to be. Let o1 plan and resolve its own steps, leveraging its autonomous reasoning.

  • Define Clear Evaluation Criteria: Develop specific criteria for what constitutes a "good" output so that o1 can evaluate its own output and improve. This moves the LLM-as-Judge into the prompt itself. Ask for one specific output per prompt.

  • Be Explicit About Output Format: o1 often defaults to a report-style output with numbered headings. Be clear if you need complete files or other specific formats.

  • Manage Context and Expect Latency: Since o1 is not a chat model, it will not respond in real time, like email. Make sure you can manage and see the context you are providing to the model. o1 is better suited to high-latency, long-running tasks.

The Deep Roots of DeepSeek: How It All Began - Recode China AI [Link]

Liang's Visions from his first public interview in May 2023:

AI Development:

  • Liang aims to build AGI, not just improve existing models like ChatGPT.
  • He prioritizes deep research over quick applications, requiring more resources.
  • He sees AI as a way to test ideas about human intelligence, like whether language is key to thought.
  • He plans to share DeepSeek’s results publicly to keep AI accessible and affordable.

Company Culture & Innovation:

  • He hires based on ability, creativity, and passion, preferring fresh graduates for key roles.
  • Employees should have freedom to explore and learn from mistakes.
  • It can't be forced or taught.
  • A shared pace and curiosity drive the team, not strict rules or KPIs.

Competition:

  • Startups can still challenge big companies since AI tech is evolving.
  • No one has a clear lead in AI yet.
  • LLM applications will become easier, creating startup opportunities for decades.
  • AI believers stay in for the long run.
  • Unconventional approaches can be a game-changer.

Resources & Funding:

  • Securing GPUs and a strong engineering team is crucial.
  • Traditional VC funding may not fit DeepSeek’s research-heavy approach.
  • Innovation is costly, and some waste is inevitable.
  • GPUs are a solid investment as they hold value.

Is DeepSeek the new DeepMind? - AI Supremacy [Link]

Implications for the AI Industry:

  • DeepSeek's emergence challenges the dominance of Western AI firms like Google DeepMind, Meta, and OpenAI. The success of DeepSeek suggests that open-source models can outperform proprietary ones. It also calls into question the massive spending on AI infrastructure by Big Tech companies.
  • Its cost-effectiveness is causing enterprises to rethink their AI strategies. The availability of high-performing, cheaper models could disrupt the business model of companies that rely on expensive, proprietary models.
  • Its achievements indicate that China is becoming a leader in AI, particularly in inference-time compute and compute efficiency. This development raises concerns about America's shrinking lead in artificial intelligence.
  • Its open-source approach is seen as essential to keeping AI inclusive and accessible. The ability to run powerful models on a laptop could decentralize AI development and reduce reliance on Big Tech.

Arguments about US vs. China in AI:

  • The article suggests that the U.S. is losing its lead in AI innovation due to its focus on "Tycoon capitalism" and protectionist policies. The U.S. government's export controls on semiconductors, while intended to slow China's progress, may be inadvertently fueling China's self-reliance and innovation.
  • China has advantages in areas such as manufacturing, go-to-market strategies, talent (STEM programs and ML researchers), and patents. China's progress in various overlapping industries creates a "mutually reinforcing feedback loop". The article implies that Chinese work culture of empowering workers with autonomy and collaboration is a strong contrast to the grueling work schedules, rigid hierarchies, and internal competition that are common in Chinese tech firms.
  • The article criticizes the massive AI infrastructure projects in the U.S. (dubbed "Project Oracle") as a scheme by the financial elite to control the future of AI. The author argues that these projects prioritize the interests of Big Tech and the financial elite over those of regular citizens and that these AI infrastructure projects are primarily intended to redistribute wealth globally to the elite.

Concerns about AI's Impact:

  • The author acknowledges concerns that AI could lead to wage deflation, particularly in white-collar jobs where AI can automate tasks.
  • It questions the assumption that AI will create more jobs than it displaces, noting that AI coding tools could negatively impact software engineers.
  • It also raises concerns about the potential for misuse of AI, including the use of AI for "authoritarian" control and as a weapon in trade wars. There are also concerns about the potential for backdoors, Trojans, model inversion attacks, sensitive information inference, and automated social engineering via the release of attractive but cheap AI services.

Additional Info:

  • DeepSeek is an offshoot of a quantitative hedge fund, High-Flyer, and is fully funded by them.
  • It is noted for being more transparent about its methods compared to some Western AI firms.
  • Its mission is to "unravel the mystery of Artificial General Intelligence (AGI) with curiosity". They focus on open-source development, research-driven innovation, and making advanced AI accessible to all.

Monopoly Round-Up: China Embarrasses U.S. Big Tech - BIG by Matt Stoller [Link]

  • DeepSeek, a Chinese AI firm, developed cost-effective AI models that rival U.S. models and released them on an open-source basis. This is a significant accomplishment, especially since the U.S. has placed export controls that prevent China from accessing the best chips. DeepSeek's approach focused on efficiency, rather than raw computing power, which challenges the assumption that computing power is the primary competitive barrier in AI. This development is considered embarrassing and threatening to big tech and U.S. security.
  • The U.S. has heavily invested in AI, with tech giants spending billions on data centers and infrastructure, betting that these investments will provide a competitive advantage. However, DeepSeek’s success suggests that this approach may be flawed. The sources suggest that the U.S. strategy of denying top chips to China may also be ineffective.
  • The sources argue that betting on monopolistic national champions is a disastrous national security strategy. It points out that history shows that monopolies are slow to innovate. The U.S. needs to prioritize competition over protecting monopolies. The sources criticize large U.S. tech firms (Meta, Microsoft, Google, Amazon, Apple) for becoming slothful bureaucracies that are not very good at developing and deploying technology.
  • Chinese policy is noted to be more aggressive in forcing competition in some sectors. China's electric vehicle industry is cited as an example of this. The Chinese government's crackdown on its big tech firms and financial sector is also mentioned as a move that has seemingly benefited the economy by driving innovation. The success of companies like ByteDance and DeepSeek is mentioned as evidence of this.
  • The sources highlight that U.S. anti-monopoly laws take too long to take effect. It uses the example of the Federal Trade Commission's case against Facebook for its acquisition of Instagram and WhatsApp. This case highlights how companies like Facebook acquire and bury innovative competitors rather than compete. It argues that if Facebook had been broken up, there would be tremendous innovation in social networking.
  • The sources express uncertainty about the future of AI, noting it might not live up to expectations. It also notes that the competitive advantages in AI are not as straightforward as previously thought.

In a rare interview for AnYong Waves, a Chinese media outlet, DeepSeek CEO Liang Wenfeng emphasized innovation as the cornerstone of his ambitious vision:

. . . we believe the most important thing now is to participate in the global innovation wave. For many years, Chinese companies are used to others doing technological innovation, while we focused on application monetization—but this isn’t inevitable. In this wave, our starting point is not to take advantage of the opportunity to make a quick profit, but rather to reach the technical frontier and drive the development of the entire ecosystem.

― 7 Implications of DeepSeek’s Victory Over American AI Companies - The Algorithmic Bridge [Link]

"Every job is a bundle of tasks.

Every new technology wave (including the ongoing rise of Gen AI) attacks this bundle.

New technology may substitute a specific task (Automation) or it may complement a specific task (Augmentation)"

Extend this analogy far enough, and you get this:

Once technology has substituted all tasks in a job bundle, it can effectively displace the job itself.

Of course, there are limits to this logic. This can only be true for a small number of jobs, which involve task execution only.

But most jobs require a lot more than mere task execution.

They require ‘getting things done’. They require achievement of objectives, accomplishment of outcomes.

In other words, most jobs involve goal-seeking.

This is precisely why previous generations of technologies haven’t fully substituted most jobs. They chip away at tasks in the job bundle without really substituting the job entirely.

Because humans retain their right to play because of their ability to plan and sequence tasks together to achieve goals.

In most previous instances, technology augments humans far more than automating an entire job away.

And that is because humans possess a unique advantage: goal-seeking.

― Slow-burn AI: When augmentation, not automation, is the real threat - Platforms, AI, and the Economics of BigTech [Link]

AI agents are the first instance of technology directly attacking and substituting goals within a role or a team.

In doing so, they directly impact power dynamics within an organization, empowering some roles and weakening others, empowering some teams and weakening others.

― How AI agents rewire the organization - Platforms, AI, and the Economics of BigTech [Link]

This is a brilliant article.

Goal-seeking, for the first time, can be performed by technology.

  1. Scope of the role: Effectively, a goal-seeking AI agent can unbundle a goal from the role. They reduce the scope of the role.
  2. Scope of the team: They displace the role entirely in a team if the team can now achieve the same goal using an AI agent.
  3. Rebundling of roles: Role B is eliminated not because its tasks were fully substituted by technology, nor because its goals were fully substituted by technology, but because the scope of the role no longer justified a separate role.
  4. Reworking power structures: Teams have voting rights on the relevance of Roles. The fewer teams speaking to a role’s contributions, the lower the negotiating power for that role within the organization.
  5. Roles unbundle, teams rebundle: this cycle of unbundling and rebundling across roles and teams is inherent to the organization of work. AI isn’t fundamentally changing goal-seeking and resource allocation. It is merely inserting itself into the organization and re-organization of work.

YouTube and Podcasts

2025 Predictions with bestie Gavin Baker - All-In Podcasts [Link]

Interesting discussions about new year predictions. Here is a summary of the predictions:

Chamath Palihapitiya:

  • Biggest Political Winner: Fiscal conservatives. He believes austerity will reveal waste and fraud in the US government and that this will spill over to state elections.
  • Biggest Political Loser: Progressivism. He predicts a repudiation of class-based identity politics in multiple Western countries.
  • Biggest Business Winner: Dollar-denominated stablecoins, which he believes will grow substantially and challenge the dominance of Visa and Mastercard.
  • Biggest Business Loser: The "MAG 7" companies will see a drawdown in absolute dollars due to high concentration in the indices. He suggests that these companies may not be able to maintain their high valuations, though they are good businesses.
  • Biggest Business Deal: The collapse of traditional auto OEMs and a wave of auto mega-mergers, triggered by Tesla's strong position.
  • Most Contrarian Belief: A banking crisis in a major mainline bank, triggered by the total indebtedness of Pax America and the impact of higher interest rates.
  • Best Performing Asset: Credit Default Swaps (CDS) as an insurance policy against a potential default event.
  • Worst Performing Asset: The software industrial complex, or large, bloated enterprise software companies.
  • Most Anticipated Trend: Small, arcane regulatory changes related to the supplemental loss ratio that allow the US to kick the debt can down the road.
  • Most Anticipated Media: The enormity of files that will be declassified and released by the Trump administration.
  • Prediction Market: The MAG 7 representation in the S&P 500 shrinks below 30%.

David Friedberg:

  • Biggest Political Winner: Young political candidates, marking a trend of a shift towards younger leaders.
  • Biggest Political Loser: Pro-war neoconservatives. He believes they will lose out to figures like JD Vance and Elon Musk.
  • Biggest Business Winner: Autonomous hardware and robotics, citing the rise of humanoid robots and their applications.
  • Biggest Business Loser: Old defense and aerospace providers, like Boeing and Lockheed Martin. He predicts a shift towards more tech-oriented and rationalized spending in defense. He also thinks Vertical SaaS companies will struggle as AI replaces their services.
  • Biggest Business Deal: Massive funding deals for hardware-based manufacturing buildout in the United States, potentially involving government support.
  • Most Contrarian Belief: A dramatic rise in socialist movements in the United States, fueled by economic inequality and disruption from AI.
  • Best Performing Asset: Chinese tech stocks or ETFs, based on potential deals between the US and China and the strong fundamentals of Chinese tech companies.
  • Worst Performing Asset: Vertical SaaS companies again as AI replaces the practices. Also legacy car companies and real estate because of overbuilding and high debt.
  • Most Anticipated Trend: The announcement of buildout of nuclear power in the United States.
  • Most Anticipated Media: AI Video Games with dynamic story lines
  • Prediction Market: Microsoft, AWS, and Google Cloud Revenue Growth.

Gavin Baker:

  • Biggest Political Winner: Trump and centrism; also Gen X and Elder Millennials.
  • Biggest Political Loser: Putin, due to Europe rearming, which shifts US resources to the Pacific, and Trump's likely tougher stance.
  • Biggest Business Winner: Big businesses that use AI thoughtfully, and the robotics industry, as well as companies that make high bandwidth memory.
  • Biggest Business Loser: Government service providers with over 35% of their revenue coming from the US government. He also thinks Enterprise application software will be hurt by AI agents
  • Biggest Business Deal: A wave of M&A after a period of inactivity and something significant happening with Intel. Also, he thinks independent AI labs will get acquired.
  • Most Contrarian Belief: The US will experience at least one year of greater than 5% real GDP growth due to AI and deregulation. He also thinks frontier AI labs will stop releasing their leading-edge models.
  • Best Performing Asset: Companies that make high bandwidth memory (HBM).
  • Worst Performing Asset: Enterprise application software.
  • Most Anticipated Trend: AI will make more progress per quarter in 2025 than it did per year in 2023 and 2024, due to scaling performance through reasoning, pre-training, and test time compute.
  • Most Anticipated Media: Season 2 of 1923
  • Prediction Market: US Treasury Market Report on Federal Debt in December 2025 above or below $38 trillion
  • UFOs: Believes there is a 25% chance the US government is sitting on knowledge of extraterrestrial life.

Jason Calacanis:

  • Biggest Business Winner: Tesla and Google for AI and Robotics
  • Biggest Business Loser: Open AI
  • Biggest Business Deal: Partnerships between Amazon, Uber, Tesla, and Waymo for autonomy, delivery, and e-commerce
  • Most Contrarian Belief: Open AI will lose its lead and its nonprofit-to-for-profit transition and become the number four player in AI.
  • Best Performing Asset: MAG 7 stocks
  • Worst Performing Asset: Legacy car companies and Real Estate.
  • Most Anticipated Trend: Exits and DPI will shower down, along with a surge in M&A and IPOs
  • Most Anticipated Media: Legacy media outlets owned by billionaires attempting to steer towards the middle
  • Prediction Market: Over or under 750,000 deportations by Trump in the first year of office

Building Anthropic | A conversation with our co-founders - Anthropic [Link]

WTF is Artificial Intelligence Really? | Yann LeCun x Nikhil Kamath | People by WTF Ep #4 - Nikhil Kamath [Link]

The Next Frontier: Sam Altman on the Future of A.I. and Society - New York Times Events [Link]

LA's Wildfire Disaster, Zuck Flips on Free Speech, Why Trump Wants Greenland [Link]

Text, camera, action! Frontiers in controllable video generation - William (Bill) Peebles [Link]

Best of 2024 in Agents (from #1 on SWE-Bench Full, Prof. Graham Neubig of OpenHands/AllHands) - Latent Space [Link]

The State of AI Startups in 2024 [LS Live @ NeurIPS] - Latent Space [Link]

Best of 2024 in Vision [LS Live @ NeurIPS] - Latent Space [Link]

Red-pilled Billionaires, LA Fire Update, Newsom's Price Caps, TikTok Ban, Jobless MBAs - All-In Podcast [Link]

NVIDIA CEO Jensen Huang Keynote at CES 2025 - NVIDIA [Link]

CES 2025 is the world's biggest tech expo. Each January, CES kicks off the tech year by highlighting everything from groundbreaking gadgets to the processors driving our digital world.

NVIDIA's CES announcements showcased its dominance in the AI chip market while highlighting its bold expansion into emerging, high-growth sectors. By emphasizing robotics, autonomous vehicles, and broader accessibility to AI, NVIDIA demonstrated its commitment to staying central to this wave of innovation.

Highlights:

  1. GeForce RTX 50 Series GPUs

    NVIDIA unveiled its latest GeForce RTX 50 series GPUs, powered by the advanced Blackwell architecture and set to launch in January. These GPUs deliver significant improvements in gaming and AI performance, with the flagship RTX 5090 priced at \(\$1,999\) and the RTX 5070 at \(\$549\), surpassing the RTX 4090, which debuted at \(\$1,599\) in 2022.

    The 50 series also introduces DLSS 4, a cutting-edge Deep Learning Super Sampling technology that employs a transformer-based architecture to generate three AI-rendered frames for every traditionally rendered one, enhancing graphics quality and gaming experiences. NVIDIA partnered with Micron to supply memory chips for these GPUs.

    Although GeForce RTX GPUs contributed only 9% of NVIDIA’s revenue in the October quarter, the company’s primary growth continues to come from its Data Center segment, driven by AI demand.

  2. AI Advancements

    NVIDIA introduced Nemotron, a new family of AI models derived from Meta’s Llama models, including Llama Nemotron Nano, Super, and Ultra, aimed at advancing AI agent capabilities. CEO Jensen Huang projects that the AI agent market could be worth trillions of dollars.

    Additionally, NVIDIA confirmed that its Blackwell AI accelerators are in full production and are being adopted by leading cloud providers and PC manufacturers, further solidifying its position in AI technology.

  3. Robotics and Autonomous Vehicles

    NVIDIA debuted Cosmos, the "world's first physical AI model," designed to advance robotics. Trained on 20 million hours of video, Cosmos is open-licensed on GitHub and integrates seamlessly with NVIDIA’s Omniverse platform to provide physics-based simulations for AI model training in robotics and autonomous systems.

    In partnership with Toyota, NVIDIA is collaborating on developing the automaker's latest autonomous vehicles. Huang sees robotics and autonomous technology as a \(\$1\) trillion market opportunity, expecting NVIDIA’s automotive revenue to grow from \(\$4\) billion in FY25 to \(\$5\) billion in FY26, spanning Data Center and OEM segments.

  4. Project DIGITS

    NVIDIA announced Project DIGITS, a personal AI supercomputer aimed at democratizing access to powerful AI tools. Starting at \(\$3,000\), the system features the GB10 Grace Blackwell Superchip, 128GB of unified memory, and up to 4TB of NVMe storage. Users can connect two systems for enhanced processing capabilities.

    Designed for AI researchers and data scientists, Project DIGITS provides a cost-effective solution for building complex AI models without relying on large-scale data center resources.

A not comprehensive summary of NVIDIA's efforts on AI, not a summary of this YouTube video:

  1. AI Compute Hardware:

    This category includes the physical processing units that perform the core calculations for AI models. These are primarily GPUs, but also include specialized CPUs and other accelerators.

    Focus: High-performance, parallel processing, low latency, memory bandwidth, energy efficiency for AI workloads.

    Examples:

    NVIDIA A100 Tensor Core GPU NVIDIA A40 Tensor Core GPU NVIDIA A10 Tensor Core GPU NVIDIA H100 Tensor Core GPU NVIDIA L40 GPU NVIDIA L4 GPU NVIDIA B100 "Blackwell" Data Center GPU NVIDIA Grace CPU Superchip GeForce RTX 30 Series (Desktop) - Ampere (Relevance for model development) GeForce RTX 50 Series (Desktop) - Blackwell (Relevance for model development) Project DIGITS - Hardware system (personal AI supercomputer).

  2. AI Platforms & Systems:

    This category includes integrated hardware and software solutions designed to simplify the development and deployment of AI applications. It encompasses both edge and data center solutions.

    Focus: Ease of use, scalability, optimized performance for specific AI tasks, deployment solutions.

    Examples:

    NVIDIA DGX A100 System NVIDIA Jetson AGX Xavier NX NVIDIA Jetson Orin NVIDIA Jetson Orin Nano NVIDIA Omniverse Platform

  3. AI Software & Development Tools:

    This category includes the software libraries, frameworks, and tools that allow developers to build, train, and deploy AI models. It covers both open source and proprietary tools.

    Focus: Developer productivity, model performance, framework support, customization.

    Examples:

    NVIDIA Merlin (Software Library) NVIDIA NeMo Framework NVIDIA TAO Toolkit

  4. AI Applications & Solutions:

    This category focuses on specific, industry-focused AI applications built on top of NVIDIA hardware and software.

    Focus: Pre-built solutions, vertical market expertise, end-to-end solutions.

    Examples: Intelligent Video Analytics (IVA), autonomous vehicle solutions, AI-driven healthcare, generative AI.

  5. AI Research and Frameworks

    While related to AI Software and development tools, it deserves its own category because of the open source nature of much of the research based tools and APIs, allowing for community contributions and new technology development.

    Focus: Next-generation tools, advanced research, pushing the limits of AI, new technologies and algorythms.

    Examples: Nemotron NVIDIA FLARE (Federated Learning Application Runtime Environment) NVIDIA Research Publications and Open-Source Projects TensorFlow and PyTorch (With NVIDIA's Extensions)

So, my takeaway was entirely different. It was not a commentary on Masa, or Larry, or Sam. I think all of those three companies are, frankly, very good. It was more a comment that you have to be very careful to protect the president's legacy, if I were them, to make sure that the things that get announced are actually further down the technical spectrum and are actually going to be real. Because if they achieve these things, but it costs you a billion dollars and you only hire 50 people, there's going to be a little bit of egg on the face. And so, that was sort of my own takeaway. I think that the things were decoupled. It just seemed more like marketing and sizzle and kind of hastily put together. I think it would be great if OpenAI builds another incredible model, whatever comes after o3, o4, o5. But it's not clear that you have to spend $500 billion to do it. - Chamath Palihapitiya

― Trump's First Week: Inauguration Recap, Executive Actions, TikTok, Stargate + Sacks is Back! - All-In Podcast [Link]

There's a thing called Jevons Paradox, which kind of speaks to this concept. SAA actually tweeted about it. It's an economic concept where, as the cost of a particular use goes down, the aggregate demand for all consumption of that thing goes up. So, the basic idea is that as the price of AI gets cheaper and cheaper, we're going to want to use more and more of it. You might actually get more spending on it in the aggregate. That's right—because more and more applications will become economically feasible. Exactly. That is, I think, a powerful argument for why companies are going to want to continue to innovate on frontier models. You guys are taking a very strong point of view that open source is definitely going to win, that the leading model companies are all going to get commoditized, and therefore, there will be no return on capital—essentially forcing continued innovation on the frontier. - David Sacks

But then there's this dark horse that nobody's talking about—it's called electricity. It's called power. And all these vehicles are electric vehicles. If you said, 'You know, I just did some quick back-of-the-envelope calculations,' if all of the miles in California went to EV ride-sharing, you would need to double the energy capacity of California. Right? Let's not even talk about what it would take to double the energy capacity of the grid and things like that in California. Let's not even go there. Even getting 10% or 20% more capacity is going to be a gargantuan, five-to-ten-year exercise. Look, I live in LA—in a nice area in LA—and we have power outages all the freaking time because the grid is messed up. They're sort of upgrading it as things break. That's literally where we're at in LA, in one of the most affluent neighborhoods. That’s just the reality. So, I think the dark horse, kind of hot take, is combustion engine AVs. Because I don’t know how you can scale AVs really, really massively with the electric grid as it is. - Travis Kalanick

I just wanted to read a message from Brian Yutko, who's the CEO of Wisk, which is building a lot of these autonomous systems. He said: 'First, automatic traffic collision avoidance systems do exist right now. These aircraft will not take control from the pilot to save the aircraft, even if the software and systems on the aircraft know that it’s going to collide. That’s the big flip that needs to happen in aviation—automation can actually kick in and take over, even in piloted aircraft, to prevent a crash. That’s the minimum of where we need to go. Some fighter jets have something called Automatic Ground Collision Avoidance Systems that do exactly this when fighter pilots pass out. It’s possible for commercial aviation as well.' And then, the second thing he said is: 'We need to have better ATC (Air Traffic Control) software and automation. Right now, we use VHF radio communications for safety and critical instructions, and that’s kind of insane. We should be using data links, etc. The whole ATC system runs on 1960s technology. They deserve better software and automation in the control towers—it’s totally ripe for change. The problem is that attempts at reform have failed.' - Chamath Palihapitiya

― DeepSeek Panic, US vs China, OpenAI $40B?, and Doge Delivers with Travis Kalanick and David Sacks - All-In Podcast [Link]

Articles and Blogs

The Art of Leading Teammates - Harvard Business Review [Link]

A Team-Focused Philosophy

  • Put the team first, always, even when facing personal adversity.
  • Show appreciation for unsung colleagues.
  • Set the standard and create a culture of 100% effort.
  • Recognize teammates’ individual psychology and the best ways to motivate them.
  • Understand and complement the style of the formal leader.
  • Recognize and counteract the external forces that can cause selfish behavior.
  • Create opportunities to connect as people outside the office.

What Helps—and What Gets in the Way

  • The emotions and behaviors that define individuals are formed early.
  • Leaders work within a system.
  • It can be hard for individual team leaders to influence change across large organizations.
  • A leader’s style and influence will take time to evolve.

Early adopters of gen AI can eclipse rivals by using it to identify entirely new product opportunities, automate routine decisions and processes, deliver customized professional services, and communicate with customers more quickly and cheaply than was possible with human-driven processes.

Far from being a source of advantage, even in sectors where its impact will be profound, gen AI will be more likely to erode a competitive advantage than to confer one, because its very nature makes new insights and data patterns almost immediately transparent to anyone using gen AI tools.

If you already have a competitive advantage that rivals cannot replicate using gen AI, the technology may amplify the value you derive from that advantage.

Businesses that try to deny the power of gen AI will certainly fail. Those that adopt it will stay in the fight. But at this stage it looks likely that the only ones that will actually win with it will be those that can apply it to amplify the advantages they already have.

― AI Won’t Give You a New Sustainable Advantage - Harvard Business Review [Link]

To prevent this problem: Ask about this: Sample questions:
Conflating Correlation And Causation Approach to determining causality Was this analysis based on an experiment? If not, are there confounders (variables that affect the independent and dependent variables)?To what extent were they addressed in the analysis?
Misjudging The Potential Magnitude Of Effects Sample size and the precision of the results What was the average effect of the change? What was the sample size and the confidence interval (or range of likely values the true effect would fall into, and the degree to which one is certain it would fall into that range)? How would our course of action change, depending on where the true effect might lie?
A Disconnect Between What Is Measured And What Matters Outcome measures What outcomes were measured? Were they broad enough? Did they capture key intended and unintended consequences? Were they tracked for an appropriate period of time? Were all relevant outcomes reported? How do we think they map to broader organizational goals?
Misjudging Generalizability Empirical setting and subgroup analysis How similar is the setting of this study to our business context? Does the context or time period of the analysis make it more or less relevant to our decision? What is the composition of the sample being studied, and how does it influence the applicability of the results? Does the effect vary across subgroups or settings? Does this tell us anything about the generalizability of the results?
Overweighting A Specific Result Broader evidence and further data collection Are there other analyses that validate the results and approach? What additional data could we collect, and would the benefit of gathering it outweigh the cost of collecting it? How might this change our interpretation of the results?

― Where Data-Driven Decision-Making Can Go Wrong - Harvard Business Review [Link]

Will Psychedelics Propel Your Career? - Harvard Business Review [Link]

Do you want to take a 'trip'? lol

How Scalable Compute Resources Can Boost LLM Performance - HuggingFace [Link]

This blog explains how to scale test-time compute for models like OpenAI's o1 - apply dynamic inference strategies to improve performance without increasing pretraining budgets. These techniques allow smaller models to outperform larger models on tasks such as math problems.

We introduce deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering. We used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies, and draft safer responses.

― Deliberative alignment: reasoning enables safer language models - OpenAI [Link]

Moravec’s paradox is the observation by artificial intelligence and robotics researchers that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated by Hans Moravec, Rodney Brooks, Marvin Minsky, and others in the 1980s.

― Common misconceptions about the complexity in robotics vs AI - Harimus Blog [Link]

Yes, as Yann LeCun mentioned in one of his previous campus lectures, LLM might help but it is not the right solution for robotics. This article made several good points:

  • Sensorimotor Tasks Are More Complex. The source emphasizes that sensorimotor tasks are harder than many people realize. It was once assumed that perception and action were simple compared to reasoning, but this has turned out to be incorrect. This idea is known as Moravec's Paradox.
  • Real-World Interaction is the challenge. Robotics requires robots to interact with a dynamic, chaotic, and complex real world. Tasks that seem simple for humans, like picking up a coffee cup, involve complex, unconscious processes that are hard to program for a robot. Even small changes in the environment can require a complete rewrite of the robot's "move commands". Robots need to break down movements into muscle contractions and forces, which is more complex than it seems.
  • Data Requirements is another challenge. LLMs thrive on massive amounts of data, like text and images from the internet. Robotics requires precise, high-quality data that is hard to collect. The variety and preciseness of the data are also important. Unlike LLMs where quantity of data is key, in robotics, the quality of the data collected matters more than the quantity.

Regarding the question "do we need better hardware to learn", I think we need a system of sensors that can capture every physical movement of a body and every angle a body can perceive. In terms of a world model, the system needs to be on a larger scale.

OpenAI has created an AI model for longevity science - MIT Technology Review [Link]

OpenAI's success with GPT-4b micro demonstrates the potential of LLMs to go beyond natural language processing and address highly specialized scientific problems. The model's ability to redesign Yamanaka factors to improve their effectiveness by 50x could be a game-changer in stem cell research, accelerating advancements in regenerative medicine. This development highlights a significant milestone in the use of AI for scientific discovery, particularly in the field of protein engineering and regenerative medicine.

A classic pattern in technology economics, identified by Joel Spolsky, is layers of the stack attempting to become monopolies while turning other layers into perfectly-competitive markets which are commoditized, in order to harvest most of the consumer surplus; discussion and examples.

― Laws of Tech: Commoditize Your Complement - [Link]

This is exactly Meta's strategy initially competing with close-source AI model businesses - commoditize their complements to increase demand for their own products. And there are more examples mentioned in this article.

  • Core Concept:

    Products have substitutes and complements. A substitute is an alternative product that can be bought if the first product is too expensive. A complement is a product usually bought together with another product. Demand for a product increases when the price of its complements decreases. Companies strategically try to commoditize their complements to increase demand for their own products. Commoditizing a complement means driving its price down to a point where many competitors offer indistinguishable goods. This strategy allows a company to become a quasi-monopolist and divert the majority of the consumer surplus to themselves.

  • How it works:

    A company seeks a chokepoint or quasi-monopoly in a product composed of multiple layers. It dominates one layer of the stack while fostering competition in another layer. This drives down prices in the commoditized layer, increasing overall demand. The company profits from increased demand for its core product while the competitors in the commoditized layer struggle with low margins. The goal is to make a complement free or very cheap, to increase profits elsewhere. This strategy is an alternative to vertical integration.

  • Examples of Commoditization:

    • Microsoft commoditized PC hardware by licensing its OS to many manufacturers, making the PC itself a commodity and increasing demand for MS-DOS.
    • IBM commoditized the add-in market by using off-the-shelf parts and documenting the interfaces, allowing other manufacturers to produce add-on cards for their PCs, which increased the demand for PCs.
    • Netscape open-sourced its web browser to commoditize browsers and increase demand for its server software.
    • Various companies contribute to open-source software to commoditize software and increase demand for hardware and IT consulting services.
    • Sun developed Java to make hardware more of a commodity.
    • The Open Game License (OGL) was created to commoditize the Dungeons and Dragons system and drive sales of core rulebooks.
  • Open Source as a Strategic Weapon:

    Open source can be a way for companies to commoditize their complements. It allows companies to share development costs and compete with dominant players. It can also neutralize advantages held by competitors and shift the focus of competition. Open sourcing can prevent a single company from locking up a technology.

  • Generalization:

    Many products are composed of layers, each necessary but not sufficient for the final product. The final product is valuable, but the distribution of revenue among the different layers is contentious. Commoditizing complements is a way to control the market without vertical integration. The division of revenue is influenced by power plays and market dynamics.

  • Additional Examples:

    The sources list many examples of commoditization in various industries, including hardware vs. software, banks vs. merchants, apps vs. OSes, game portals vs. game devs, telecom vs. users, and many more. The examples illustrate the breadth of this strategy across various tech and non-tech sectors. There are examples of companies commoditizing themselves, such as Stability AI, who commoditized image-generation models and saw little profit themselves.

  • Counter-Examples:

    • Sun Microsystems' strategy of making both hardware and software a commodity was not successful.
    • Some companies, like Apple, try to control both the hardware and software aspects of their products, which goes against the commoditization strategy.
  • Other Factors:

    Antitrust actions can influence companies and prevent them from crushing competitors. Fear of antitrust actions may have stopped Microsoft from crushing Google.

  • Consequences:

    The commoditization of complements can lead to intense competition in certain layers of the tech stack. It can also lead to a concentration of power and revenue in the hands of companies that control key chokepoints.

Reports and Papers

Mixtral of Experts [Link]

Key innovation: Sparse Mixture of Experts (SMoE) with TopK=2.

The state of Generative AI and Machine Learning at the end of 2023 - Intel Tiber AI Studio [Link]

Trends and insights of AI development and deployment in the enterprise - a survey result.

Does Prompt Formatting Have Any Impact on LLM Performance? [Link]

Prompt formats significantly affect LLM performance, with differences as high as 40% observed in code translation tasks for GPT-3.5-turbo. Larger models like GPT-4 demonstrate more resilience to prompt format changes.

JSON format outperformed Markdown in certain tasks, boosting accuracy by 42%. GPT-4 models exhibited higher consistency in responses across formats compared to GPT-3.5 models.

Deliberative Alignment: Reasoning Enables Safer Language Models [Link]

Their training methodology has two stages: 1) supervised fine-tuning on (prompt, CoT, output) datasets where CoTs explicitly reference safety policies, 2) high-compute RL using a reward model informed by safety policies, improving reasoning and adherence.

deliberative_alignment

Genesis is a comprehensive physics simulation platform designed for general purpose Robotics, Embodied AI, & Physical AI applications. It is simultaneously multiple things:

  • A universal physics engine re-built from the ground up, capable of simulating a wide range of materials and physical phenomena.
  • A lightweight, ultra-fast, pythonic, and user-friendly robotics simulation platform.
  • A powerful and fast photo-realistic rendering system.
  • A generative data engine that transforms user-prompted natural language description into various modalities of data.

― Genesis: A Generative and Universal Physics Engine for Robotics and Beyond [Link]

Agents - Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic - Google [Link]

Agent AI: Surveying the Horizons of Multimodal Interaction [Link]

Foundations of Large Language Models [Link]

Atlas of Gray Matter Volume Differences Across Psychiatric Conditions: A Systematic Review With a Novel Meta-Analysis That Considers Co-Occurring Disorders [Link]

"Gray matter volume (GMV) differences across major mental disorders" refers to variations in the amount or density of gray matter in the brain when comparing individuals with mental disorders to those without. Gray matter consists of neuronal cell bodies, dendrites, and synapses and is essential for processing information, controlling movements, and supporting higher cognitive functions like memory, attention, and decision-making.

Structural Abnormalities: Mental disorders are often associated with changes in the brain's structure. GMV differences can highlight specific brain regions that are smaller, larger, or differently shaped in individuals with mental disorders.

Neurobiological Insights: Identifying GMV changes helps researchers understand the neurobiological basis of mental disorders and how these changes may contribute to symptoms like mood dysregulation, cognitive impairment, or altered behavior.

Target for Interventions: Understanding these differences can inform treatments such as targeted therapies, neurostimulation, or cognitive training to address the affected brain regions.

From Efficiency Gains to Rebound Effects: The Problem of Jevons’ Paradox in AI’s Polarized Environmental Debate [Link]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Link]

DeepSeek-R1 is an open-source reasoning model that matches OpenAI-o1 in math, reasoning, and code tasks.

News

NVIDIA Project DIGITS, A Grace Blackwell AI Supercomputer on your desk - NVIDIA [Link]

Constellation inks $1 billion deal to supply US government with nuclear power - Yahoo [Link]

Why 2025 will be the year of AI orchestration [Link]

2025 is anticipated to be the year of AI orchestration for several reasons:

  • In 2024, there was broad experimentation in AI, particularly with agentic use cases. In 2025, these pilot programs, experiments, and new use cases are expected to converge, leading to a greater focus on return on investment.
  • As organizations deploy more AI agents into their workflows, the need for infrastructure to manage them becomes more critical. This includes managing both internal workflows and those that interact with other services.
  • Decision-makers, especially those outside of the technology sector, are seeking tangible results from their AI investments. They are moving beyond experimentation and expect to see a return on their investment in 2025.
  • There will be a greater emphasis on productivity, which involves understanding how multiple agents can be made more effective. This will require a focus on accuracy and achieving higher productivity.
  • Many new orchestration options are emerging to address the limitations of existing tools such as LangChain. Companies are building orchestration layers to manage AI applications. These frameworks are still early in development, and the field is expected to grow.
  • There will be a focus on integrating agents across different systems and platforms, such as AWS's Bedrock and Slack, to allow for the transfer of context between platforms.
  • The emergence of powerful reasoning models like OpenAI's o3 and Google's Gemini 2.0 will make orchestrator agents more powerful.

Perplexity AI makes a bid to merge with TikTok U.S. - CNBC [Link]

OpenAI, Alphabet Inc.’s Google, AI media company Moonvalley and several other AI companies are collectively paying hundreds of content creators for access to their unpublished videos, according to people familiar with the negotiations.

― YouTubers Are Selling Their Unused Video Footage to AI Companies - Bloomberg [Link]

Stavridis says Trump’s plan for Greenland ‘not a crazy idea’ - The Hill [Link]

California’s Wildfire Insurance Catastrophe - WSJ [Link]

Rising premiums and limited coverage options could significantly impact Californians, particularly in wildfire-prone areas. The article calls out state leadership for failing to adapt policies to address climate-related risks effectively.

Our robotics team is focused on unlocking general-purpose robotics and pushing towards AGlevel intelligence in dynamic, real-world settings. Working across the entire model stack, we integrate cutting-edge hardware and software to explore a broad range of robotic form factors. We strive to seamlessly blend high-level AI capabilities with the physical constraints of physical.

― OpenAI has begun building out its robotics team - VentureBeat [Link]

It's surprising because I kind of remember in a public interview Sam said he is not going to go hardware as it's going to be as efficient on that as those companies with hardware foundations like Tesla, NVIDIA, Meta, etc. Now, it's hiring its first hardware robotics roles as announced by Caitlin Kalinowski.

OpenAI’s $500B ‘Stargate Project’ could aid Pentagon’s own AI efforts, official says - Breaking Defense [Link]

This article highlights OpenAI's ambitious Stargate Project and its potential impact on both commercial and government sectors, particularly the U.S. Department of Defense (DoD). Stargate represents a bold step in building the next generation of AI infrastructure, and its success could profoundly influence the future of both private AI development and national security capabilities. The collaboration between industry leaders and government stakeholders will be key to overcoming technical and financial hurdles.

Here are key takeaways:

OpenAI's Stargate Project:

  • Objective: Build $500 billion worth of AI infrastructure, including new data centers and power solutions, primarily aimed at training and operating large AI models.
  • Initial Funding: $100 billion to be deployed immediately, with ongoing development starting in Texas and other potential sites in the U.S.
  • Collaborators: Japan-based SoftBank, Oracle, UAE-based MGX, NVIDIA, Microsoft, and Arm.

DoD Implications:

  • AI Challenges in Defense: The DoD faces significant bottlenecks in computing power to meet the demands of modern AI applications, from battlefield decision-making to intelligence analysis and coordinating multi-domain operations (CJADC2).
  • Reliance on Private Sector: Stargate could provide essential computing power to address the Pentagon's high-tech needs, especially where DoD lacks in-house capacity.
  • Field Applications: Supercomputing resources are essential for training and retraining AI models in dynamic environments, such as battlefield conditions where new inputs may arise.

Challenges:

  • Energy Demands: Generative AI models like ChatGPT consume immense electricity. The DoD must consider scalable and portable power sources, such as compact nuclear plants.
  • Funding Scrutiny: Despite public commitments, concerns about the financial capability of Stargate’s backers, including SoftBank, have raised questions.
  • Technical Constraints: Effective use of AI in military applications depends on robust, secure, and reliable infrastructure to handle high-bandwidth connections and avoid vulnerabilities to jamming.

Political and Economic Context:

  • The Stargate Project was announced at a high-profile White House event, underscoring its perceived importance to national interests.
  • Skepticism from figures like Elon Musk about the financial feasibility of such an enormous project adds to the intrigue surrounding its rollout.

Trump is planning 100 executive orders starting Day 1 on border, deportations and other priorities - AP News [Link]

A new neural-network architecture developed by researchers at Google might solve one of the great challenges for large language models (LLMs): extending their memory at inference time without exploding the costs of memory and compute. Called Titans, the architecture enables models to find and store during inference small bits of information that are important in long sequences.

Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle both short- and long-term memory tasks efficiently. According to the researchers, LLMs that use neural long-term memory can scale to millions of tokens and outperform both classic LLMs and alternatives such as Mamba while having many fewer parameters.

― Google’s new neural-net LLM architecture separates memory components to control exploding costs of capacity and compute [Link]

TikTok restoring service after Trump vows to delay ban - AXIOS [Link]

TikTok's response to the Supreme Court decision [Link]

Amazon bought more renewable power last year than any other company - TechCrunch [Link]

Ozempic, Wegovy and other drugs are among 15 selected for Medicare’s price negotiations [Link]

Waymo Finds a Way Around US Restrictions Targeting Chinese Cars [Link]

More Speech and Fewer Mistakes - Meta News [Link]

NVIDIA Cosmos - NVIDIA [Link]

Announcing The Stargate Project - OpenAI [Link]

OpenAI announces the Stargate Project, a \(\$500\) billion effort to create advanced AI infrastructure. The project begins with an immediate \(\$100\) billion deployment for data centers, starting in Texas. It supports OpenAI’s goal of scaling artificial general intelligence (AGI) and training advanced AI models. Focus on high-value fields like personalized medicine and biotechnology.

NVIDIA GPUs power compute-intensive workloads. Oracle provides high-capacity cloud infrastructure. Microsoft Azure supports scalable distributed AI model training.

Introducing Operator - OpenAI [Link]

It's an AI agent that automates tasks directly in a web browser. You can use Operator to complete repetitive tasks like filling out forms, booking travel, or ordering items online. It uses a new model called Computer-Using Agent (CUA), which integrates GPT-4's vision capabilities with reinforcement learning to interact with graphical user interfaces (GUIs).

Introducing Citations on the Anthropic API - Anthropic [Link]

Happy New Year (/≧▽≦)/

There is a lot to talk about Reinforcement Learning from Human Feedback (RLHF). How about starting with Reinforcement Learning (RL) basics.

Warning: Extremely long article ahead :)

Overview

The process of training a model using reinforcement learning from human feedback (RLHF) involves three key steps, as outlined in the paper titled “Training language models to follow instructions with human feedback” by OpenAI.

instructGPT_overview_RLHF

Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

The agent is the decision-maker or learner in the RL framework. It performs actions in the environment and learns from the feedback it receives. The environment represents everything external to the agent that it interacts with. It provides feedback in response to the agent’s actions. The state is a representation of the current situation of the environment as perceived by the agent. An action is a decision or move taken by the agent at each step based on its policy (a mapping from states to actions). The reward is a scalar feedback signal provided by the environment to indicate how good or bad an action was in achieving the agent’s goal.

An RL problem is typically formalized as a Markov Decision Process (MDP), which includes:

  • States (\(S\)): The set of all possible situations in which the agent can find itself.
  • Actions (\(A\)): The set of all possible moves or decisions available to the agent.
  • Transition Dynamics (\(P(s'|s,a)\)): The probability of transitioning to a new state \(s'\) given the current state \(s\) and action \(a\).
  • Rewards (\(R(s,a)\)): The immediate feedback received after taking action \(a\) in state \(s\).
  • Policy (\(\pi(a|s)\)): A strategy that defines how the agent selects actions based on states.

The goal of RL is to find an optimal policy \(\pi^*\) that maximizes cumulative rewards (also called return). This involves balancing short-term rewards with long-term planning using trial-and-error interactions with the environment.

agent_env_rewards_RL_intro

The challenges arised from the nature of environment and its dynamics are non-stationary environments, stochastic rewards, and random states:

  • In non-stationary environments, the dynamics of the environment (e.g., transition probabilities or reward functions) change over time. This forces RL agents to continuously adapt their policies, which can lead to a drop in performance during the readjustment phase and forgetting previously learned policies.
  • Stochastic rewards occur when the reward function is probabilistic rather than deterministic. This introduces noise into the feedback signal, making it harder for the agent to discern which actions truly lead to higher rewards.
  • Random states refer to situations where the agent’s observations are noisy or partially observable, making it harder to infer the true state of the environment. Such randomness complicates policy learning because the agent may need to rely on memory or belief states (e.g., Partially Observable Markov Decision Processes, POMDPs) to make decisions. It increases the dimensionality and complexity of the state space.

The challenges related to algorithmic design and computational feasibility are:

  • RL algorithms require a significant amount of interaction with the environment to learn effectively, making them data-intensive. Many RL algorithms, particularly model-free methods like policy gradient techniques, require a large number of samples to converge.
  • RL agents face the exploration-exploitation dilemma, where they need to balance trying new actions to discover potentially better rewards (Exploration) and using known actions that yield high rewards (Exploitation).
  • Many RL problems involve enormous state and action spaces, such as games like Go or real-world robotics tasks. The exponential growth of possible states and actions makes it computationally challenging for RL algorithms to find optimal solutions.
  • Poorly designed rewards can lead to unintended behaviors (e.g., an agent exploiting loopholes in the reward structure). Sparse or delayed rewards make it difficult for the agent to associate its actions with outcomes.
  • RL agents often struggle to generalize learned behaviors across different tasks or environments. Agents trained in specific simulations (e.g., driving simulators) may fail to perform well in real-world scenarios due to differences in dynamics, noise, or variability.
  • RL algorithms are highly sensitive to hyperparameter choices (e.g., learning rate, discount factor). Poor tuning can lead to slow convergence or failure to converge at all, making training unpredictable and requiring significant expertise.
  • RL agents often use complex models (e.g., deep neural networks), making their decisions difficult to interpret. This lack of transparency is problematic in safety-critical applications like healthcare or autonomous driving, where understanding the reasoning behind decisions is essential.

Multi-Armed Bandit (MAB)

The multi-armed bandit (MAB) problem is a classic RL problem that exemplifies the exploration-exploitation tradeoff. It provides a simplified framework for decision-making under uncertainty.

Here is a simple scenario to help understand the Multi-Armed Bandit (MAB) problem. Imagine a doctor has three types of prescription drugs to treat a particular disease and \(N\) patients to treat. At the beginning, the doctor has no knowledge about which drug is the most effective. The goal is to identify the best action—the drug that can cure the highest number of patients.

To achieve this goal, we can define action values as:

\[ Q_t(a) = E[R_t \mid A_t = a], \]

where: - \(R_t\) is a random variable representing whether a patient is cured (reward), - \(a\) is an action, which in this case corresponds to selecting a specific type of drug for the patients.

The best action is the one that maximizes the expected reward:

\[ a^* = \arg\max_a Q(a). \]

It’s important to note that an expectation, \(E[x]\), is typically calculated as:

\[ E[x] = \sum x p(x), \]

where \(p(x)\) represents the probability distribution of \(x\). However, in real-world scenarios where \(p(x)\) is unknown and data is limited, the expectation can be approximated using sample averages:

\[ E[x] \approx \frac{\sum x}{N}, \]

where \(N\) is the total number of observations of \(x\). This approximation process is known as Monte Carlo estimation.

The action value \(Q_t(a)\) can be estimated by Sample-Average Method using the following formula:

\[ Q_t(a) = \frac{\text{Total rewards received when action } a \text{ was taken before time } t}{\text{Number of times action } a \text{ was taken before time } t}. \]

Mathematically, this can be expressed as:

\[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot I_{A_i = a}}{\sum_{i=1}^{t-1} I_{A_i = a}}, \]

where: - \(R_i\) is the reward received at time step \(i\), - \(I_{A_i = a}\) is an indicator function that equals 1 if action \(a\) was selected at time step \(i\) and 0 otherwise.

The best action can be select by Greedy Approach: \[ a^* = \arg\max_a Q(a). \] In our case as demonstrated in the diagram below, after 4 times, \(Q_t(1)=0.5, Q_t(2)=0.75, Q_t(3)=0.25\), the best action is determined by \(\arg \max_a Q_t(a)\) which is Action \(A_2\) (\(a=2\)).

However, this approach has some drawbacks such as small sample size and non-stationary environment (e.g. patients are in different consitions). An intuitive alternative is to give Action \(A_1\) and Action \(A_3\) more opportunities. This is called Exploration - Exploitation Tradeoff, which means to balance trying new actions to discover potentially better rewards (Exploration) and using known actions that yield high rewards (Exploitation).

A better approach is called Epsilon-Greedy Strategy which is a simple yet effective method for addressing the exploration-exploitation tradeoff in RL. It involves:

  1. Exploration: With a probability of \(\epsilon\), the agent chooses a random action, allowing it to explore the environment and gather new information.
  2. Exploitation: With a probability of \(1-\epsilon\), the agent selects the action that has the highest estimated reward (greedy action) based on its current knowledge.

In our case, let \(\epsilon = 20\%\), \(Q_t(1)\) and \(Q_t(3)\) each is given \(10\%\), and \(Q_t(2)\) is given \(80\%\). The next round (5th) of action is decided by random sampling \(A_1,A_2,A_3\) with probability of \(10\%, 80\%,10\%\). If the sampled action is \(A_1\) and the reward is \(1\), then its action value is updated to be is \(Q_t'(1) = (0+1+1+0+1)/5 =0.6\).

doctor_treatment_example

The pseudo code of Epsilon Greedy Approach is as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Initialize
for a = 1 to K:
Q(a) <- 0 # Estimated value of each arm
N(a) <- 0 # Number of times each arm has been pulled

# Epsilon-Greedy Algorithm
for t = 1 to num_turns:
with probability ε:
A <- randomly select an arm (exploration)
otherwise:
A <- select the arm with the highest Q(a) (exploitation)

# Pull the selected arm and observe reward R
R <- Reward(A)

# Update the estimates for the selected arm
N(A) <- N(A) + 1
Q(A) <- Q(A) + (1 / N(A)) * (R - Q(A)) # Incremental update formula

Note that there is a math trick in the incremental updates Q(A) <- Q(A) + (1 / N(A)) * (R - Q(A)). \[ \begin{equation} \begin{aligned} Q_{n+1} &= {1\over n}\sum^n_{i=1}R_i \space \text{ : average reward in the n+1 iteration}\\ &= {1\over n}(R_n + \sum^{n-1}_{i=1}R_i)\\ &= {1\over n}(R_n + (n-1){1\over n-1}\sum^{n-1}_{i=1}R_i)\\ &= {1\over n} (R_n + (n-1)Q_n)\\ &= {1\over n}(R_n+ n \times Q_n - Q_n)\\ &= Q_n + {1\over n} (R_n - Q_n) \end{aligned} \end{equation} \] The higher the \(\epsilon\), the more opportunities given to actions, and the higher average reward.

epsilon_greedy_method

(Source: Reinforcement Learning by Sutton and Barto, Chapter 2)

Agent

Long Term Goal

The goal of Agent is long-term reward \[ G_t = R_{t+1}+R_{t+2}+R_{t+3}+... \] So the objective is expected reward \(E[G_t]\) \[ E[G_t] = E[R_{t+1}+R_{t+2}+R_{t+3}+...] \] There are different types of agent tasks:

  • Episodic Task: Episodic tasks consist of distinct episodes, where each episode has a clear beginning and end. At the end of an episode, the environment resets to a starting state.

  • Continuing Task: Continuing tasks involve ongoing interactions with no natural endpoint. The agent interacts with the environment indefinitely. A key challenge in continuing tasks is that the cumulative reward (\(E[G_t]\)) can become unbounded as time progresses. This makes it difficult to optimize an unbounded objective directly.

    To make the objective bounded, a discount factor (\(\gamma\)) is introduced. The discount factor ensures that more weight is given to immediate rewards while gradually reducing the importance of future rewards. This approach stabilizes the optimization process. \(\gamma \in (0,1)\) is a scalar that determines how much future rewards are discounted compared to immediate rewards. In practice, \(\gamma\) is often set close to 1 (e.g., 0.95 or 0.98), allowing the agent to consider long-term rewards while still prioritizing recent ones.

    The following derivations demonstrate how discounting makes the objective bounded. \[ \begin{equation} \begin{aligned} G_t &= \gamma R_{t+1}+\gamma^2 R_{t+2}+\gamma^3R_{t+3}+...+\gamma^{k-1} R_{t+k} ...\\ &=\sum^{\infty}_{k=0}\gamma^k R_{t+k+1}\\ &\leq \sum^{\infty}_{k=0} \gamma^k \times R_{max} \space \text{ ,where }R_{max} = \max\{R_{t+1},R_{t+2},...R_{t+k}\}\\ &=R_{max} \sum^{\infty}_{k=0} \gamma^k \\ &= R_{max} {1\over 1-\gamma} < \infty \end{aligned} \end{equation} \] The value of \(\gamma\) influences how far-sighted or short-sighted the agent is. If \(\gamma\) is large, the agent is far-sighted, meaning it prioritizes long-term rewards over immediate ones. If \(\gamma\) is small, the agent is short-sighted, focusing heavily on immediate rewards while ignoring distant future outcomes.

    The cumulative reward can also be written as follows, representing how the current cumulative reward is determined by the next step's reward and cumulative reward: \[ \begin{equation} \begin{aligned} G_t &= \gamma R_{t+1}+\gamma^2 R_{t+2}+...\\ &=R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + ...)\\ &=R_{t+1} + \gamma G_{t+1} \end{aligned} \end{equation} \]

Policy

The outcome of RL is policy \(\pi\), which is a projection or mapping from input state \(s\) to output action \(a\).

Deterministic Policy: A deterministic policy maps each state to a single, specific action. In other words, given the same state, the agent will always select the same action. Deterministic policies may fail in environments with high uncertainty or noise, as they do not allow for the exploration of alternative actions. \[ \pi(s)=a \] Stochastic Policy: A stochastic policy maps each state to a probability distribution over possible actions. For a given state, the agent selects an action based on this distribution, meaning different actions can be chosen with varying probabilities. Requires learning and maintaining a probability distribution over actions, which can be computationally expensive. \[ \begin{aligned} \pi(a|s) &= P(a|s) \geq 1, \\ &\text{where }P(a|s) \text{ is the probability of selecting action a in state s.}\\ \sum_{a\in A(s)}\pi(a|s)&=1 \end{aligned} \]

Bellman Equations*

State-Value Functions and Action-Value Functions

State-Value Functions: denoted as \(V_\pi(s)\), represents the expected cumulative future rewards starting from a particular state \(s\) and following a specific policy \(\pi\) thereafter. It measures the "goodness" of being in the state \(s\), considering the long-term rewards achievable from that state under policy \(\pi\). It does not depend on specific actions but rather on the overall behavior dictated by the policy. \[ \begin{aligned} V_\pi(s)= E_\pi[G_t|S_t=s], \\ \space G_t=\sum^\infty_{k=0}\gamma^kR_{t+k+1} \end{aligned} \] Action-Value Functions: denoted as \(Q_\pi(s,a)\), represents the expected cumulative future rewards starting from the state \(s\), taking action \(a\), and then following a specific policy \(\pi\) thereafter. It measures the “goodness” of taking action \(a\) in state \(s\), considering both immediate rewards and future rewards achievable under the policy \(\pi\). It provides more granular information than \(V(s)\), as it evaluates specific actions rather than just states. \[ Q_\pi(s,a) = E_\pi[G_t|S_t=s, A_t=a] \] The relationship between the state-value function and the action-value function can be expressed using the following formula: \[ V_\pi(s) =\sum_{a \in A} \pi(a|s) Q_\pi (s,a) \] This equation shows that the value of a state under policy \(\pi\), \(V_\pi(s)\), is the expected value of the action-value function \(Q_\pi(s,a)\), weighted by the probability of taking each action \(a\) in state \(s\) according to policy \(\pi(a|s)\).

State-Value Bellman Equation and Action-Value Bellman Equation

State-Value Bellman Equation:

bellman_equation_huggingface

(Source: The Bellman Equation: simplify our value estimation)

The State-Value Bellman Equation can be written in a recursive form. \[ \begin{aligned} V_\pi(s) &= E_\pi(G_t|S_t=s)\\ &= E_\pi(R_{t+1}+rG_{t+1}|S_t=s)\\ &=\sum_a \pi(a|s) \sum_{s'}\sum_r p(s',r|s,a) [r + V_\pi(s')] \end{aligned} \]

The tree structure below as an example can help understand the recursive property of the State-Value Bellman Equation. Note that an action \(a\) does not necessarily lead to a specific state \(s\), it can result in multiple possible states, each with a certain probability. These probabilities are determined by the environment, which we typically do not have direct access to.

bellman_equation_tree

Action-Value Bellman Equation:

action_value_equation_huggingface

(Source: Two types of value-based methods)

The Action-Value Bellman Equation can be written in a recursive form as well: \[ \begin{aligned} Q_\pi(s,a)&= E_\pi[G_t|S_t=s, A_t=a]\\ &=\sum_{s'}\sum_{r} P(s',r|s,a)[r+\gamma\sum_{a'}\pi(a'|s')Q_\pi(s',a')] \end{aligned} \] The tree structure below as an example can help understand the recursive property of the Action-Value Bellman Equation.

action_value_bellman_equation_tree

The main limitations of Bellman Equation:

  • In real-world problems, the number of states can be extremely large, requiring a separate Bellman equation for each state. This results in a system of simultaneous nonlinear equations due to the presence of the max operator, which can be difficult to solve.
  • Solving Bellman equations often requires iterative methods and can demand significant computational resources. This is particularly true when seeking high-precision approximations over many iterations.
  • In applications like the Bellman-Ford algorithm for finding shortest paths, the presence of negative cycles can pose a problem. If a cycle has a negative total sum of edges, it can lead to an undefined shortest path since iterating through the cycle can indefinitely reduce the path length.
  • The Bellman equation is inherently nonlinear because it involves maximizing over possible actions. This nonlinearity can complicate finding solutions, especially when dealing with large state spaces or complex reward structures.

Policy Iteration

In RL, an optimal policy is a policy that maximizes the expected cumulative reward (or return) for an agent across all states in a Markov Decision Process (MDP). This means that the state-value function \(v_{\pi}(s)\) or the action-value function \(q_{\pi}(s,a)\) under the optimal policy is greater than or equal to that of any other policy for all states and actions. \[ \pi^*(s) \geq \pi(s), \forall s \in \text{states} \] In finite MDPs, at least one optimal policy always exists. However, there may be multiple optimal policies that achieve the same maximum expected return.

Policy Iteration is a dynamic programming algorithm used in RL to compute the optimal policy \(\pi^*\) for a Markov Decision Process (MDP). It alternates between two main steps: policy evaluation and policy improvement, iteratively refining the policy until convergence.

The full policy iteration algorithm pseudocode is in the figure below (Source: Sutton & Barto summary chap 04 - Dynamic Programming):

policy_iteration_pseudocode

Here is a detailed explanation of this policy iteration's algorithm pseudocode.

Repeat steps until convergence:

  1. Policy evaluation: keep current policy \(\pi\) fixed, find value function \(V(\cdot)\).

    Iterate Bellman update until values converge:

    \[ V(s) \leftarrow \sum_{s',r}p(s',r|s,\pi(s))[r+\gamma V(s')] \] The Bellman operator computes future rewards but discounts them by multiplying with \(\gamma\). This ensures that differences in value functions become smaller with each iteration. In other words, Bellman shrinks distances. To see it mathematically,

    The Bellman operator for the state-value function under a fixed policy \(\pi\) is defined as \[ V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_s P(s|s,\pi(s))V(s') \] This operator updates the value function by combining the immediate reward and the discounted future rewards.

    We compute the difference after applying the operator: \[ \Big|V^{\pi}_1(s)-V^{\pi}_2(s)\Big| = \Big|r(s,\pi(s))+\gamma\sum_s P (s|s,\pi(s))V_1(s')-\Big[r(s,\pi(s))+\gamma\sum_s P (s|s,\pi(s))V_2(s')\Big]\Big| \] Simplify by canceling out the immediate rewards, we get: \[ \Big|V^{\pi}_1(s)-V^{\pi}_2(s)\Big| = \gamma \Big|\sum_s P (s|s,\pi(s))\Big(V_1(s')-V_2(s')\Big)\Big| \] Since \(\gamma<1\), the difference between \(V_1^{\pi}(s)\) and \(V_1^{\pi}(s)\) is always smaller than the difference between \(V_1(s')\) and \(V_1(s')\). Because Bellman operator shrinks distances, it is a contraction mapping and follows the contraction mapping property.

    In summary, Policy evaluation is a contraction mapping for a fixed policy \(\pi\). Policy evaluation converges because it applies a contraction mapping repeatedly to compute the value function for a fixed policy.

  2. Policy improvement: find the best action for \(V(\cdot)\) via one-step lookahead.

    During policy improvement, the current policy \(\pi\) is updated by selecting actions that maximize the expected return for each state \(s\).

    \(\pi(s) \leftarrow \arg \max_a \sum_{s',r}p(s',r|s,a)[r+\gamma V(s')]\)

    The intuition behind: \(V^{\pi}(s)\) measures how good it is to start from the state \(s\) and follow policy \(\pi\). By improving the actions selected by the policy, we ensure that the agent transits into states with higher expected cumulative rewards. This iterative process ensures that each new policy improves upon or equals the previous one in terms of total expected rewards.

Overall, the idea of Policy Iteration can be demonstrated in the diagram below. The evaluation process usually takes a long time while the improvement process is usually fast.

policy_iteration_eval_impro_demo

The Generalized Policy Iteration (GPI) method can speed up the policy evaluation process. In GPI, policy evaluation process can be approximate or partial (e.g., only a few iterations instead of full convergence). Policy improvement can also be done incrementally or partially. GPI speeds up the process of finding an optimal policy by relaxing the strict requirements of full convergence in policy evaluation.

generalized_policy_iteration

Monte Carlo Method

Monte Carlo methods are specifically designed for episodic tasks, where each episode eventually terminates, regardless of the actions taken by the agent. An episode refers to a complete sequence of interactions between an agent and the environment, starting from an initial state and ending in a terminal state.

Monte Carlo for Policy Evaluation

Value function and policy updates occur only after an episode is complete. By waiting for the end of an episode, Monte Carlo methods ensure that all rewards following a state are included in the return calculation. Monte Carlo methods learn purely from sampled experience (episodes), without requiring knowledge of transition probabilities or reward models. Episodes allow Monte Carlo methods to handle stochastic environments by averaging returns across multiple episodes.

Here is an example of calculating the state-value functions and action-action functions by Monte Carlo Method once 2 episodes are completed. Given states \(S=[A,B,C,D,E], A=[1,2,3]\), and two episodes \(E_1,E_2\), (Note: \(A:(1,0.4)\) means state \(A\), action \(1\), and reward \(0.4\)) \[ \begin{aligned} &E_1 = \{A:(1,0.4), B:(2,0.5), A:(2,0.6), C:(2,0.1), B:(3,0.8), E:()\}\\ &E_2 = \{B:(2,0.5), A:(1,0.6), C:(2,0.3), A:(1,0.3), C:(2,0.8), E:()\} \end{aligned} \]

  1. State-Value Functions Calculation

    We can calculate \(V(A),V(B),V(C),V(D),V(E)\).

    e.g. there are 4 sequences starting from state \(A\), then the state value function is: \[ \begin{aligned} V(A) &= [(0.4+\gamma 0.5 + \gamma^2 0.6 + \gamma^3 0.1 + \gamma^4 0.8)\\ &+(0.6+\gamma 0.1 + \gamma^2 0.8)\\ &+(0.6+\gamma 0.3 + \gamma^2 0.3 + \gamma^4 0.8)\\ &+(0.3+\gamma 0.8)] / 4 \end{aligned} \]

  2. Action-Value Functions Calculation

    We can calculate \(Q(A,1),Q(B,2),\cdots\).

    e.g. there are three sequences starting from \(A:(1,)\), then the action value function is \[ \begin{aligned} Q(A,1) &= [(0.4+\gamma 0.5 + \gamma 0.6 + \gamma^3 0.1+ \gamma^4 0.8)\\ &+(0.6+\gamma 0.3 + \gamma^2 0.3 + \gamma^3 0.8)\\ &+(0.3+\gamma 0.8)]/3 \end{aligned} \]

The pseudocode for the above Monte Carlo for Policy Evaluation is as follows (Source: Reinforcement Learning by Sutton and Barto, Chapter 5):

montecarlo_statevalue

As part of the algorithm, it loops for each step of episode from the end \(T-1\) to the beginning of the episode. This allows for dynamic programming where some values can be stored and do not need to be re-calculated (see a simple demonstration below).

montecarlo_algo_dynamic_program

Monte Carlo for Policy Improvement

\[ \pi_{k+1}(s)=\arg \max_a q_{\pi_k}(s,a) \]

Here is an example of updating policy by action-value function. Given states \(S=[A,B,C,D,E], A=[1,2,3], \gamma=0.5\), and a episode \(E\), (Note: \(A:(1,0.7)\) means state \(A\), action \(1\), and reward \(0.7\)) \[ E = \{A:(1,0.7), B:(1,0.4), A:(3,1.5), C:(2,0.1), B:(3,0.7), A:(1,0.3)\}\ \] Through dynamic programming, cumulative reward is \(G_5(A,1)=0.3\), \(G_4(B,3)=0.7+0.5*3=0.85\), \(G_3(C,2)=0.1+0.5*0.85=0.52\), \(G_2(A,3)=1.5+0.52*0.5=1.76\), \(G_1(B,1)=0.4+1.76*0.5=1.28\), \(G_0(A,1)=0.7+1.28*0.5=1.34\). We can maintain three lists to make the algorithm work:

  • Return matrix \(Returns(S,A)\), dimension \((S, A)\): It stores cumulative reward values \(G(S=s,A=a)\). One cell can store multiple values as the number of episodes increases.
  • \(Q(S,A)\) matrix: It's initialized as random numbers at the beginning. Updated whenever the return matrix is updated. \(Q(S,A)\) is the average value of the corresponding \(Returns(S,A)\).
  • \(\pi(s)\) list: It's updating Epsilon values by giving \(1-\epsilon\) probability to the action with highest reward for each state, according to the updated \(Q(S,A)\) matrix. It facilitates Epsilon Greedy Algorithm.

The final updating result of the above example is in the diagram below.

mc_action_value_update_res

The pseudocode for the above Monte Carlo for Policy Improvement with action-value function is as follows (Source: Reinforcement Learning by Sutton and Barto, Chapter 5):

monte_carlo_action_value
mc_control_epsilon_pi

Main Limitation

  • Policy updates occur only after an episode is completed, which can slow down learning compared to methods like Temporal Difference (TD) learning that update incrementally after each step.
  • MC methods do not use bootstrapping (i.e., they do not update value estimates based on other estimates). While this avoids bias, it also means MC methods cannot leverage intermediate value estimates, leading to slower convergence.

Temporal Difference Learning

TD leanring focuses on estimating the value function of a given policy by updating value estimates based on the difference between successive predictions, rather than waiting for an entire episode to conclude.

Given \[ \begin{aligned} V(S_t) &\leftarrow V(S_t) + \alpha \Big[G_t - V(S_t)\Big]\\ G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = R_{t+1} + \gamma G_{t+1}\\ V_{\pi}(s) &= E_\pi[G_t|S_t=s] = E_\pi\Big[R_{t+1} + \gamma G_{t+1}| S_t=s\Big] = R_{t+1} + \gamma V_\pi(S_{t+1}) \end{aligned} \] We can derive the core function of TD: \[ V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(R_{t+1}) - V(S_t)] \] The pseudocode of TD learning is as follows. (Source: Sutton & Barto summary chap 06 - Temporal Difference Learning)

td_learning_for_v

From table to function approximation

Main limitations of table based methods:

  • As the number of state variables increases, the size of the state space grows exponentially.
  • Storing a value table becomes impractical for large or continuous state/action spaces due to memory limitations.
  • Table-based methods treat each state (or state-action pair) independently. They cannot generalize knowledge from one part of the state space to another, requiring every state to be visited multiple times for accurate learning.
  • In large state spaces, it is unlikely that an agent will visit all relevant states/actions frequently enough to converge to an optimal policy within a reasonable time frame.
  • Table-based methods are well-suited for small problems but fail to scale to real-world applications such as autonomous driving, robotics, or complex games like Go or StarCraft.

From tabular to parametric functions:

Fit a parametric function to approximate the value function \(V(s)\) which maps states \(s\) to their corresponding value estimates. \[ f(s,\theta) \approx V(s) \] where \[ f(s,\theta)=w^Ts+b \] To optimize this approximation, we minimize the mean squared error (MSE) loss between the observed value \(v(s)\) and predicted \(\hat{v}(s,{w})\) from Monte Carlo. \(\mu(s)\) is the probability distribution over states. This loss ensures that the predicted values are as close as possible to the observed values. \[ \ell = \min \sum_s \mu(s) \Big[v_\pi(s)-\hat{v}(s,{w})\Big]^2 \] The optimal \({w}\) that minimize the loss can be found by batch Gradient Descent (\(\eta\) is learning rate). \[ w \leftarrow w - \eta \nabla \ell(w) \] where \[ \begin{aligned} \nabla \ell(w) &= \nabla \sum_s \mu(s) \Big[v_\pi(s) - \hat{v}(s,{w})\Big]^2\\ &= \sum_s \mu(s) \nabla \Big[v_\pi(s) - \hat{v}(s,{w})\Big]^2\\ &= -2 \sum_s \mu(s) \Big[v_\pi(s) - \hat{v}(s,{w})\Big]\nabla \hat{v}(s,{w}) \end{aligned} \] From Gradient Descent to Stochastic Gradient Descent:

While batch gradient descent computes gradients over the entire dataset (all states), this can be computationally expensive for large-scale problems. Instead, stochastic gradient descent (SGD) updates the parameters incrementally using one observation at a time. Given observations \((S_1, v_\pi(S_1)), (S_2, v_\pi(S_2)), (S_3, v_\pi(S_3)), ...\), SGD performs updates as follows (\(\alpha\) is learning rate). \[ \begin{aligned} {w}_2 &= {w}_1 + \alpha \Big[v_\pi(S_1) - \hat{v}(S_1,{w_1})\Big] \nabla \hat{v}(S_1,{w}_1)\\ {w}_3 &= {w}_2 + \alpha \Big[v_\pi(S_2) - \hat{v}(S_2,{w_2})\Big] \nabla \hat{v}(S_2,{w}_2)\\ &\cdots \end{aligned} \] The algorithm of Gradient Monte Carlo is as follows. (Source: Reinforcement Learning by Sutton and Barto)

gradient_montecarlo_v

PPO Prior

Average Reward

The average reward is an alternative to the commonly used discounted reward framework. It measures the long-term average reward per time step under a given policy, making it particularly suitable for continuing tasks (non-episodic problems) where there is no natural endpoint or terminal state.

The average reward framework is particularly useful for continuing tasks, where:

  • The task does not have a natural termination point (e.g., robot navigation, server optimization, or industrial control systems).
  • The agent operates indefinitely, and evaluating its performance based on long-term behavior (rather than episodic returns) is more meaningful.

The average reward for a policy is defined as: \[ r(\pi)=\lim_{h \rightarrow \infty} {1\over h}\sum^h_{t=1} E\Big[R_t | S_0, A_{0:t-1} \sim \pi\Big] \] This simple example shows how average reward is calculated:

average_reward_policy

Differential Return

The differential return in RL is a concept that arises in the average reward framework, particularly for continuing tasks. It measures the cumulative deviation of rewards from the long-term average reward rate, \(r(\pi)\) , under a given policy \(\pi\).

Differential return aligns with the goal of maximizing long-term performance in continuing tasks by focusing on deviations from steady-state behavior. Unlike discounted returns, differential return does not rely on a discount factor \(\gamma\). This avoids biases introduced by choosing an arbitrary discount factor. It is particularly well-suited for tasks with no natural termination, such as robotics or industrial control systems.

The differential return at time step \(t\) is defined as: \[ G_t = R_{t+1} - r(\pi)+R_{t+2} - r(\pi)+R_{t+3} - r(\pi) \] Then the value functions can be rewritten as \[ \begin{aligned} v_\pi(s) = \sum_a \pi(a|s)\sum_{r,s'} p(s',r|s,a) \Big[r-r(\pi) +v_{\pi}(s')\Big]\\ q_\pi(s,a) = \sum_{r,s'} p(s',r|s,a) \Big[r-r(\pi) +\sum_{a'}\pi(a'|s')q_{\pi}(s',a')\Big] \end{aligned} \] Algorithms like Gradient Monte Carlos can be rewritten by using this differential return.

Policy Gradient and REINFORCE

Objective of policy gradient: \[ \begin{aligned} J(\theta) &= \sum_{s\in S}d^{\pi}(s)V^\pi(s) = \sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a)\\ d^\pi(s) &= \lim_{t\rightarrow \infty}P(s_t=s|s_o,\pi_\theta) \rightarrow \text{ converege (Markov Property)}\\ &\max J(\theta): \\ &\max \sum_{s\in S} d^\pi(s) V^\pi(s) \implies \theta \leftarrow \theta + \nabla_\theta J(\theta) \end{aligned} \] Policy gradient theorem allows us to compute the gradient of the expected return with respect to the parameters of a parameterized policy, enabling optimization through gradient ascent. \[ \begin{aligned} \nabla_\theta J(\theta) &= \nabla _\theta\Big[\sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a)\Big]\\ &\propto \sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s)\\ &\implies \theta \leftarrow \theta + \eta\Big[\sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s) \Big] \end{aligned} \] Since Monte Carlo involves a sampling step, which requires an expectation form. The gradient derived above can be re-written as follows, supporting gradient estimation. (Recall: \((\ln x)' = 1/x\)) \[ \begin{aligned} \nabla_\theta J(\theta) &\propto \sum_{s\in S} d^\pi(s) \sum_{a\in A} Q^\pi(s,a) \nabla _\theta\pi_\theta(a|s)\\ &=\sum_{s\in S}d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a) {\nabla_{\theta} \pi_\theta(a|s)\over \pi_\theta(a|s)}\\ &=E_\pi\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ (&=E_{s \sim \pi, a \sim \pi_\theta(a|s)}\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big])\\ \end{aligned} \] Given the above theorems, a Reinforce Algorithm - Monte-Carlo Policy-Gradient algorithm is defined as follows. (Source: Sutton & Barto summary chap 13 - Policy Gradient Methods)

montecarlo_policy_gradient_pi

A differentiable policy ensures that small changes in \(\theta\) result in smooth changes in the action probabilities \(\pi(a|s,\theta)\). This is crucial for stable and efficient learning. The softmax function is commonly used to parameterize policies in discrete action spaces. The softmax function is smooth and differentiable, enabling gradient-based optimization. Softmax ensures that all actions have non-zero probabilities, promoting exploration during training.

Here the differentiable policy parameterization \(\pi(a|s,{\theta})\) can be defined by \[ \pi(a|s,\theta) = {\exp(h(s,a,\theta)) \over \sum_{b\in A} \exp(h(s,b,\theta))} \] where \(h(s,a,\theta)=w_a^T+b_a\) is a linear or non-linear function representing the preference for action \(a\). The denominator normalizes the probabilities so that they sum to 1.

The log of the softmax function has a convenient derivative that simplifies gradient computation: \[ \nabla_\theta\ln \pi(a,s,\theta) = \nabla_\theta h(s,a,\theta) - \sum_b \pi(b|s,\theta) \nabla_\theta h(s,b,\theta) \]

Main Limitations of REINFORCE

  • REINFORCE requires complete episodes to compute the return for each state, as it relies on Monte Carlo estimates of the expected return. This makes it unsuitable for continuing tasks (non-episodic environments) where there is no clear terminal state.
  • The gradient estimates in REINFORCE have high variance because they depend on the total return from sampled episodes, which can vary significantly due to stochasticity in the environment and policy.
  • Since REINFORCE updates the policy only after completing an episode, it does not make use of intermediate data. This results in poor sample efficiency, requiring a large number of episodes to learn effectively.
  • Unlike Temporal Difference (TD) methods, REINFORCE does not use bootstrapping (i.e., it does not update value estimates based on other estimates). It relies solely on complete returns from episodes.
  • The algorithm is highly sensitive to the choice of the learning rate. A poorly chosen learning rate can lead to divergence or extremely slow convergence.

Advantage Function

Advantage function employs the idea of differential return. In the REINFORCE algorithm, with advantage function, policy gradient can be re-written as \[ \begin{aligned} \nabla_\theta J(\theta) &=E_\pi\Big[Q^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ &=E_\pi\Big[A^\pi(s,a)\nabla_\theta \ln\pi_\theta(a|s)\Big]\\ \end{aligned} \] where \[ \begin{aligned} A^{\pi}(s,a) &= Q^\pi(s,a)-V^\pi(s)\\ V^\pi(s) &= \sum_{a\in A}\pi(a|s) Q(s,a) \end{aligned} \]

Off-Policy Policy Gradient

Off-policy policy gradient methods allow learning a target policy while using data generated from a different behavior policy. By reusing past experiences and learning from suboptimal actions, off-policy methods can significantly improve sample efficiency. Off-policy learning allows for better exploration strategies since it can incorporate data from various policies, including exploratory ones.

The policy gradient estimate is defined as \[ \nabla_\theta J(\theta) = E_\beta\Big[{\pi_\theta(a|s)\over \beta(a|s)} Q^\pi(s,a) \nabla _\theta \ln \pi_\theta (a|s)\Big] \] \(\beta(a|s)\) refers to the behavior policy that generates the data used for training. The behavior policy is not necessarily the optimal policy we want to learn (the target policy \(\pi(a|s)\)). Instead, it can be any policy that provides useful exploration of the state-action space. \({\pi_\theta(a|s)\over \beta(a|s)}\) is the important weight for sampling.

Trust Region Policy Optimization (TRPO)

TRPO is an advanced policy gradient method in RL designed to optimize policies while ensuring stable and reliable updates. It addresses some of the limitations of traditional policy gradient methods by incorporating a trust region constraint that limits how much the policy can change in a single update. The difference between REINFORCE and TRPO is that TRPO uses off-policy policy gradient and advantage function, as well as a constraint on the Jullback-Leibler (KL) divergence between the old and new policy.

Recall that REINFORCE's objective of policy gradient is: \[ J(\theta) = \sum_{s\in S} d^\pi(s) \sum_{a\in A} \pi_\theta(a|s) Q^\pi(s,a) \] The derivation of TRPO's objective of policy gradient is: \[ \begin{aligned} J(\theta) &=\sum_{s \in S} d^{\pi}(s) \sum_{a\in A} (\pi_\theta(a|s) \hat{A}_{\theta_{old}}(s,a))\\ &=\sum_{s\in S} d^{\pi_{\theta_{old}}} \sum_{a\in A} (\beta(a|s) {\pi_\theta(a|s)\over \beta(a|s)} \hat{A}_{\theta_{old}}(s,a))\\ &=E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \beta}\Big[{\pi_\theta(a|s)\over \beta(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ &=E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \pi_{\theta_{old}}}\Big[{\pi_\theta(a|s)\over \pi_{\theta_{old}}(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ E_{s\sim d^{\pi_{\theta_{old}}}}&\Big[D_{KL}\Big(\pi_{\theta_{old}}(\cdot | s)||\pi_\theta(\cdot|s)\Big)\Big] \leq \delta \end{aligned} \] The TRPO constrained optimization is defined as \[ \begin{aligned} &\max E_{s\sim d^{\pi_{\theta_{old}}}, a \sim \pi_{\theta_{old}}}\Big[{\pi_\theta(a|s)\over \pi_{\theta_{old}}(a|s)} \hat{A}_{\theta_{old}}(s,a)\Big]\\ & s.t. E_{s\sim d^{\pi_{\theta_{old}}}}\Big[D_{KL}\Big(\pi_{\theta_{old}}(\cdot | s)||\pi_\theta(\cdot|s)\Big)\Big] \leq \delta \end{aligned} \] One of the main limitations of TRPO is that the constrained optimization problem can be computationally intensive, especially for large state and action spaces.

PPO

PPO Objective

To address the computational expense of the constrained optimization in TRPO, researchers introduced the CLIP objective in policy gradient methods. The CLIP objective simplifies the optimization process while maintaining stable policy updates. Below are the TRPO objective and its corresponding CLIP version: \[ \begin{aligned} J^{TRPO}(\theta) &= E[r(\theta)\hat{A}_{\theta_{old}}(s,a)]\\ J^{CLIP}(\theta) &= E[\min(r(\theta)\hat{A}_{\theta_{old}}(s,a)), \text{clip}(r(\theta),1-\epsilon, 1+\epsilon) \hat{A}_{\theta_{old}}(s,a)] \end{aligned} \] where \[ \begin{aligned} &r(\theta) = {\pi_{\theta (a|s)} \over \pi_{\theta_{old}(a|s)}}\\ &r(\theta) \in [1-\epsilon, 1+ \epsilon], \\ &i.e. 1-\epsilon \leq r(\theta) \leq 1+\epsilon \end{aligned} \] If \(r(\theta) > 1+\epsilon, r(\theta)=1+\epsilon\). If \(r(\theta) < 1-\epsilon, r(\theta) = 1-\epsilon\).

The CLIP objective ensures that the policy ratio \(r(\theta)\) does not deviate too far from 1 (the old policy), thereby limiting large updates to the policy. The term \(J^{CLIP}(\theta)\) takes the minimum of the “unclipped” objective and the “clipped” version. This prevents overly large policy updates by removing the lower bound when \(r(\theta)\) is outside the clipping range.

The Proximal Policy Optimization (PPO) algorithm (Paper: Proximal Policy Optimization Algorithms) extends the CLIP objective by incorporating additional terms for value function optimization and entropy regularization. The full PPO objective is defined as: \[ J^{PPO}(\theta) = E[J^{CLIP}(\theta) - c_1(V_\theta(s)-V_{target})^2 + c_2 H(s,\pi_\theta(\cdot))] \] where

  • \(-(V_\theta(s) - V_{target})^2\) is the negative mean squared error (MSE), which we aim to maximize. It minimizes the difference between the predicted value function \(V_\theta(s)\) and the target value \(V_{target}\). The coefficient \(c_2\) controls the tradeoff between policy optimization and value function fitting.
  • \(H(s,\pi_\theta(\cdot))\) represents the entropy of the policy. Maximizing entropy encourages exploration by preventing premature convergence to deterministic policies. The coefficient \(c_2\) determines the weight of this entropy term.

Here is a pseudocode of PPO-Clip Algorithm (Source: OpenAI Spinning Up - Proximal Policy Optimization)

ppo_clip_algo

PPO Usage

State, Action, and Reward in the Context of LLMs

In the context of LLMs, the components of reinforcement learning are defined as follows:

  1. State: The state corresponds to the input prompt or context provided to the language model. It represents the scenario or query that requires a response.
  2. Action: The action is the output generated by the language model, i.e., the response or continuation of text based on the given state (prompt).
  3. Reward: The reward is a scalar value that quantifies how well the generated response aligns with human preferences or task objectives. It is typically derived from a reward model trained on human feedback.
  4. Policy: A policy refers to the strategy or function that maps a given state (input prompt and context) to an action (the next token or sequence of tokens to generate). The policy governs how the LLM generates responses and is optimized to maximize a reward signal, such as alignment with human preferences or task-specific objectives.

Steps of RLHF Using PPO

The RLHF process using PPO involves three main stages:

  1. Training a Reward Model: A reward model is trained to predict human preferences based on labeled data. Human annotators rank multiple responses for each prompt, and this ranking data is used to train the reward model in a supervised manner. The reward model learns to assign higher scores to responses that align better with human preferences.

  2. Fine-Tuning the LLM with PPO: After training the reward model, PPO is used to fine-tune the LLM. The steps are as follows:

    1. Initialize Policies: Start with a pre-trained LLM as both the policy model (actor) and optionally as the critic for value estimation.

      • The actor is the language model that generates responses (actions) based on input prompts (states).

        For example: Input: “Explain quantum mechanics.” Output: “Quantum mechanics is a branch of physics that studies particles at atomic and subatomic scales.”

      • The critic is typically implemented as a value function, which predicts how good a particular response (action) is in terms of achieving long-term objectives. This model predicts a scalar value for each token or sequence, representing its expected reward or usefulness.

        For example:

        Input: “Explain quantum mechanics.” → “Quantum mechanics is…” Output: A value score indicating how well this response aligns with human preferences or task objectives.

      • Both the actor and critic can be initialized from the same pre-trained LLM weights to leverage shared knowledge from pretraining. However, their roles diverge during fine-tuning: The actor focuses on generating responses. The critic focuses on evaluating those responses.

    2. Collect Rollouts: Interact with the environment by sampling prompts from a dataset. Generate responses (actions) using the current policy. Compute rewards for these responses using the trained reward model.

    3. Compute Advantage Estimates: Use rewards from the reward model and value estimates from the critic to compute advantages: \[ \hat{A}(s, a) = R_t + \gamma V(s_{t+1}) - V(s_t), \] where $ R_t $ is the reward from the reward model.

    4. Optimize Policy with PPO Objective: Optimize the policy using PPO's clipped surrogate objective: \[ J^{CLIP}(\theta) = \mathbb{E}\left[\min\left(r(\theta)\hat{A}(s, a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A}(s, a)\right)\right], \] where $ r() = $ is the probability ratio between new and old policies.

    5. Update Value Function: Simultaneously update the value function by minimizing mean squared error between predicted values and rewards: \[ \mathcal{L}_{\text{value}} = \mathbb{E}\left[(V_\theta(s) - R_t)^2\right]. \]

    6. Repeat: Iterate over multiple epochs until convergence, ensuring stable updates by clipping policy changes.

  3. Evaluation: Evaluate the fine-tuned LLM on unseen prompts to ensure it generates outputs aligned with human preferences. Optionally, collect additional human feedback to further refine both the reward model and policy.

The following diagrams summarizes the high-level RLHF process with PPO, from preference data creation, to training a reward model, and using reward model in an RL loop to fine tune LLM.

PPO_RLHF_flowchart

The following workflow chart illustrates the more detailed training process of RLHF with PPO. (Source: Secrets of RLHF in Large Language Models Part I: PPO)

RLHF_training_realworld

RLHF Training Tricks

There are practical challenges that arise during RLHF training. These challenges stem from the inherent complexities of RL, especially when applied to aligning LLMs with human preferences. Therefore, tricks are essential for addressing the practical limitations of RLHF, ensuring the training process remains efficient, stable, and aligned with human preferences while minimizing the impact of inherent challenges in RL systems. (Source: Secrets of RLHF in Large Language Models Part I: PPO)

RLHF_training_tricks

DPO

Bradley-Terry and Plackett-Luce Reward Model

The Bradley-Terry (BT) model is a probabilistic model used to compare pairwise preferences. It assumes that each item (e.g., a response or completion) has an intrinsic quality score, and the probability of one item being preferred over another depends on the relative difference in their scores.

Mathematically, the probability of option \(y_1\) being preferred over option \(y_2\) is given by: \[ P(y_1 \succ y_2|x) = {\exp(r(x,y_1)) \over \exp(r(x,y_1)) + \exp(r(x,y_2))} \] The loss of this reward model is \[ L_R(r_{\phi},D) = -E_{(x,y_w,y_l)\sim D} \Big[\log \sigma\Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big] \] However, the BT model has some limitations:

  • It assumes transitivity in preferences (if \(A>B\) and \(B>C\), then \(A >C\)), which may not always hold in real-world data.
  • It only handles pairwise comparisons and does not naturally extend to rankings involving more than two items.

The Plackett-Luce (PL) model generalizes the Bradley-Terry model to handle rankings of multiple items, not just pairwise comparisons. It models the probability of a ranking as a sequence of choices. The first-ranked item is chosen based on its relative worth compared to all other items. The second-ranked item is chosen from the remaining items, and so on.

Mathematically, for a ranking \(i_1\succ i_2 \succ ... \succ i_J\), the probability is given by: \[ P(i_1\succ i_2 \succ ... \succ i_J ) = \prod^j_{j=1}{\alpha_{i_j}\over \sum^J_{k=j} \alpha_{i_k}} \] where \(\alpha_{i_j}\) is the worth or quality score of item \(i_j\). The denominator normalizes over all remaining items at each step.

The PL model has several advantages over the BT model:

  • Unlike BT, which only works with pairwise comparisons, PL can handle rankings involving multiple items.
  • PL can accommodate partial rankings (e.g., ranking only the top n items), making it more versatile in scenarios where full rankings are unavailable.
  • When human feedback involves ranking multiple responses rather than just picking one as better, PL captures this richer information better than BT.

DPO Objective

The main reason why RLHF with PPO is hard is that it takes a lot of redundant effort. Policy Model is all we need, all other efforts are not necessary. DPO (Direct Preference Optimization) is a novel alternative to traditional RLHF for fine-tuning LLMs. It simplifies the RLHF process by eliminating the need for complex reward models and RL algorithms. Instead, DPO reframes the problem of aligning LLMs with human preferences as a classification problem using human-labeled preference data.

The idea is DPO and difference between DPO and PPO are shown in the figure below (Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

DPO_idea

Recall the Bradley-Terry reward model: \[ \begin{aligned} P(y_1 \succ y_2|x) &= {\exp(r(x,y_1)) \over \exp(r(x,y_1)) + \exp(r(x,y_2))}\\ L_R(r_{\phi},D) &= -E_{(x,y_w,y_l)\sim D} \Big[\log \sigma\Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big] \end{aligned} \] RLHF objective is defined as follows. Keep in mind that no matter whether DPO or PPO is used, the objective is always like this. \[ \max_{\pi_\theta} E_{x \sim D, y \sim \pi_\theta(y|x)}\Big[r_{\phi}(x,y) - \beta D_{KL}\big[\pi_\theta(y|x) || \pi_{ref}(y|x)\big]\Big] \] where \(\beta D_{KL}\big[\pi_\theta(y|x) || \pi_{ref}(y|x)\big]\) is a regularization term. When applying RL to NLP, regularization is often needed. Otherwise RL would explore every possible situation and find out hidden tricks which deviate from a language model.

The optimal policy \(\pi_r(y|x)\) that maximizes the objective is \[ \begin{aligned} \pi_r(y|x) &= {1\over Z(x)}\pi_{ref}(y|x)\exp\Big({1\over \beta}r(x,y)\Big)\\ Z(x) &= \sum_y \pi_{ref}(y|x) \exp\Big({1\over \beta}r(x,y)\Big) \end{aligned} \] where \(\pi_r(y|x)\) is a probability distribution.

Based on this optimal policy, we can derive the reward function for the optimal policy \[ r(x,y)=\beta \log{\pi_r(y|x)\over \pi_{ref}(y|x)} + \beta \log Z(x) \] If we put this reward function in the Bradley-Terry model, we obtain a probability of \(y_1\) being prefered to \(y_2\). \[ \begin{aligned} P^*(y_1 \succ y_2|x) &= {\exp^*(r(x,y_1)) \over \exp^*(r(x,y_1)) + \exp^*(r(x,y_2))}\\ &={\exp(\beta \log{\pi^*_r(y_1|x)\over \pi_{ref}(y_1|x)} + \beta \log Z(x)) \over \exp(\beta \log{\pi^*_r(y_1|x)\over \pi_{ref}(y_1|x)} + \beta \log Z(x)) + \exp(\beta \log{\pi^*_r(y_2|x)\over \pi_{ref}(y_2|x)} + \beta \log Z(x))}\\ &={1\over 1+\exp\Big(\beta \log {\pi^*(y_2|x) \over \pi_{ref}(y_2|x)} - \beta\log {\pi^*(y_1|x)\over \pi_{ref}(y_1|x)}\Big)}\\ &=\sigma\Big(\beta \log {\pi^*(y_1|x)\over \pi_{ref}(y_1|x)} - \beta \log {\pi^*(y_2|x)\over \pi_{ref}(y_2|x)}\Big)\\ \end{aligned} \] With this probability, we have DPO's objective function below. We can optimize this loss function by Maximum Likelihood Extimation: \[ L_{DPO}(\pi_\theta; \pi_{ref}) = -E_{(x,y_w,y_l) \sim D} \Big[\log \sigma \Big(\beta \log {\pi_{\theta}(y_w|x)\over \pi_{ref}(y_w|x)} - \beta \log {\pi_{\theta}(y_l|x)\over \pi_{ref}(y_l|x)}\Big)\Big)\Big] \] Key Ideas of DPO Objective:

  • DPO's objective aims to increase the likelihood of generating preferred responses over less preferred ones. By focusing directly on preference data, DPO eliminates the need to first fit a reward model that predicts scalar rewards based on human preferences. This simplifies the training pipeline and reduces computational overhead.
  • Value functions exist to help reduce the variance of the reward model. In DPO, the value function is not involved because DPO does not rely on a traditional RL framework, such as Actor-Critic methods. Instead, DPO directly optimizes the policy using human preference data as a classification task, skipping the intermediate steps of training a reward model or estimating value functions.
  • DPO was originally designed to work with pairwise preference data, however, recent advancements and adaptations have extended its applicability to ranking preference data as well (e.g RankDPO).

DPO paper has provided detailed steps of deriving the gradient of the DPO objective: (Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

dpo_gradients_derivation_paper

The simplified version of the DPO gradient for better understanding is written as follows. Intuitively, when the difference between \(\hat{r}_{\theta}(x, y_l)\) and \(\hat{r}_{\theta}(x, y_w)\) is large, the gradient takes a larger step during optimization. Conversely, when the difference is small, the objective is optimized with a smaller adjustment.

DPO_gradient_simple_version

DPO Usage

Here's how DPO is applied step by step:

1. Initial Setup and Supervised Fine-Tuning (SFT): Begin by fine-tuning a pre-trained LLM using supervised learning on a dataset that is representative of the tasks the model will perform. This step ensures the model has a strong foundation in the relevant domain, preparing it for preference-based optimization.

2. Collect Preference Data: Gather human feedback in the form of pairwise preferences or rankings. Annotators evaluate responses generated by the model and indicate which ones they prefer. Construct a dataset of prompts and corresponding preferred and less-preferred responses.

3. Iterative Rounds of DPO

  • Sampling and Annotation: In each round, sample a set of responses from the model for given prompts. Collect new preference annotations based on these samples, allowing for dynamic updates to the preference dataset. (Public preference data works as well. Off-policy and on-policy data both work).

  • Preference Optimization: Use DPO to adjust the model's outputs based on collected preference data:

  • Model Update: Fine-tune the model using this loss function to increase the likelihood of generating preferred responses.

4. Evaluation and Iteration

  • Performance Assessment: After each round, evaluate the model’s performance on new prompts to ensure it aligns with human preferences. Use feedback from these evaluations to inform subsequent rounds of sampling and optimization.

  • Iterative Refinement: Continue this loop process over multiple rounds, iteratively refining the model's alignment with human preferences through continuous sampling and preference optimization.

DPO Performance

(Source: Direct Preference Optimization: Your Language Model is Secretly a Reward Model)

DPO_performance_paper

DPO Objective Pseudocode

\[ L_{DPO}(\pi_\theta; \pi_{ref}) = -E_{(x,y_w,y_l) \sim D} \Big[\log \sigma \Big(\beta \log {\pi_{\theta}(y_w|x)\over \pi_{ref}(y_w|x)} - \beta \log {\pi_{\theta}(y_l|x)\over \pi_{ref}(y_l|x)}\Big)\Big)\Big] \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch.nn.functional as F

def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):
"""
pi_logps: policy logprobs, shape (B,)
ref_logps: reference model logprobs, shape (B,)
yw_idxs: preferred completion indices in [0, B-1], shape (T,)
yl_idxs: dispreferred completion indices in [0, B-1], shape (T,)
beta: temperature controlling strength of KL penalty

Each pair of (yw_idxs[i], yl_idxs[i]) represents the
indices of a single preference pair.
"""

pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]
ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]

pi_logratios = pi_yw_logps - pi_yl_logps
ref_logratios = ref_yw_logps - ref_yl_logps

losses = -F.logsigmoid(beta * (pi_logratios - ref_logratios))
rewards = beta * (pi_logps - ref_logps).detach()

return losses, rewards

DPO Variants

The key area of research involves developing variants of DPO and conducting theoretical analyses to understand its limitations and potential improvements. This includes exploring different loss functions or optimization strategies that can be applied within the DPO framework.

Main Difficulties in RLHF

Data Collection

In practice, people noticed that the collection of human feedback in the form of the preference dataset is a slow manual process that needs to be repeated whenever alignment criteria change. And there is increasing difficulty in annotating preference data as models become more advanced, particularly because distinguishing between outputs becomes more nuanced and subjective.

Reward Hacking

Reward hacking is a common problem in reinforcement learning, where the agent learns to exploit the system by maximizing its reward through actions that deviate from the intended goal. In the context of RLHF, reward hacking occurs when training settles in an unintended region of the loss landscape. In this scenario, the model generates responses that achieve high reward scores, but these responses may fail to be meaningful or useful to the user.

In PPO, reward hacking occurs when the model exploits flaws or ambiguities in the reward model to achieve high rewards without genuinely aligning with human intentions. This is because PPO relies on a learned reward model to guide policy updates, and any inaccuracies or biases in this model can lead to unintended behaviors being rewarded. PPO is particularly vulnerable to reward hacking if the reward model is not robustly designed or if it fails to capture the true objectives of human feedback. The iterative nature of PPO, which involves continuous policy updates based on reward signals, can exacerbate this issue if not carefully managed.

DPO avoids explicit reward modeling by directly optimizing policy based on preference data. However, it can still encounter issues similar to reward hacking if the preference data is biased or if the optimization process leads to overfitting specific patterns in the data that do not generalize well. While DPO does not suffer from reward hacking in the traditional sense (since it lacks a separate reward model), it can still find biased solutions that exploit out-of-distribution responses or deviate from intended behavior due to distribution shifts between training and deployment contexts.

  • The article "Reward Hacking in Reinforcement Learning" by Lilian Weng discusses how reward hacking occurs when a RL agent exploits flaws or ambiguities in the reward function to achieve high rewards without genuinely learning the intended task. It highlights that in RLHF for language models, reward hacking is a critical challenge, as models might learn to exploit unit tests or mimic biases to achieve high rewards, which can hinder real-world deployment.
  • The research "Scaling Laws for Reward Model Overoptimization" explores how optimizing against reward models trained to predict human preferences can lead to overoptimization, hindering the actual objective.
    1. Impact of Policy Model Size: Holding the RM size constant, experiments showed that larger policy models exhibited similar overoptimization trends as smaller models, despite achieving higher initial gold scores. This implies that their higher performance on gold rewards does not lead to excessive optimization pressure on the RM.
    2. Relationship with RM Data Size: Data size had a notable effect on RM performance and overoptimization. Models trained on fewer than ~2,000 comparison labels showed near-chance performance, with limited improvement in gold scores. Beyond this threshold, all RMs, regardless of size, benefited from increased data, with larger RMs showing greater improvements in gold rewards compared to smaller ones.
    3. Scaling Laws for RM Parameters and Data Size: Overoptimization patterns scaled smoothly with both RM parameter count and data size. Larger RMs demonstrated better alignment with gold rewards and less susceptibility to overoptimization when trained on sufficient data, indicating improved robustness.
    4. Proxy vs. Gold Reward Trends: For small data sizes, proxy reward scores deviated significantly from gold reward scores, highlighting overoptimization risks. As data size increased, the gap between proxy and gold rewards narrowed, reducing overoptimization effects.

Note that the KL divergence term in the RLHF objective is intended to prevent the policy from deviating too much from a reference model, thereby maintaining stability during training. However, it does not fully prevent reward hacking. Reward hacking occurs when an agent exploits flaws or ambiguities in the reward model to achieve high rewards without genuinely aligning with human intentions. The KL divergence penalty does not correct these flaws in the reward model itself, meaning that if the reward model is misaligned, the agent can still find ways to exploit it. KL does not directly address whether the actions align with the true objectives or desired outcomes.

QLoRA is never as simple as a single line of code. Let's start from the scaling law...

Neural Scaling Law

In the context of LLMs, a scaling law refers to an empirical relationship that describes how the performance of a model changes as key resources—such as model size (number of parameters), dataset size, and computational power—are scaled up or down. These laws provide insights into how increasing these factors impacts model accuracy, efficiency, and generalization capabilities.

scalinglaw_3charts

Compute, dataset size, and model size are not independent of each other. Data size and model size together determine compute. The paper "Algorithmic progress in language models" came up with a rule \(C=6ND\) where \(C\) is compute, \(N\) is model size, and \(D\) is data size.

According to the paper "Scaling Laws for Neural Language Models" and "Training Compute-Optimal Large Language Models", below are the key takeaways of scaling laws for LLMs:

  • Performance Improvement: Model performance often improves predictably with increases in size, data, and compute, typically following a power-law relationship.
  • Diminishing Returns: Beyond certain thresholds, the benefits of scaling diminish, meaning further increases in resources yield smaller performance gains.
  • Trade-offs: Effective scaling requires balancing resources like model parameters and training data. For example, the "Chinchilla scaling law" highlights that increasing data size can sometimes yield better results than merely increasing model size in compute-constrained settings.

These observations are critical for LLM research:

  • Guidance for Optimization: Scaling laws help researchers allocate resources efficiently and predict the outcomes of scaling efforts, guiding both model design and training strategies. For example, within fixed computational constraints and limited training duration, scaling laws provide a principled approach to determining the optimal model size that minimizes test loss.

  • Predicting model performance: As demonstrated in GPT-4 Technical Report, by fitting the scaling law to the loss of smaller models, the loss of a bigger model can be predicted accurately.

    gpt4_report_scalinglaw

The scaling law overlooks a critical practical consideration, which can lead to misconceptions. While it suggests that larger models yield better performance, in reality, the primary compute bottleneck lies in inference rather than training. Training compute is often less constrained because training time can be extended, but deployment costs are significantly higher. From a practical standpoint, a more efficient approach is to train a smaller model for an extended period, as this substantially reduces inference compute requirements.

Quantization

Background

As the scaling law suggested, when training a LLM, reducing the number of parameters is probably not an optimal idea for saving computational resource. Luckily, neural nets are robust in low precision, which means lowering precision won't reduce much model performance.

In GTC March 2024 Keynote with NVIDIA CEO Jensen Huang, Jensen stated that NVIDIA has achieved 1000X increase compute power for the past 8 years, faster than Moore’s law. It can be noticed that in the graph they show TFLOPs on FP8 precision in 2022 and TFLOPs on FP4 precision in 2024. This is a trick because it's easier to achieve higher TFLOPs when the precision is lower. And it shows that there is a trend in hardware industry to achieve higher TFLOPs in low precision.

nvidia_moorelaw

Data Structure - FP32

The IEEE 754 single-precision floating-point format (FP32) represents a 32-bit number in binary form. It is used for approximating real numbers. The FP32 format has three components:

  1. Sign bit (1 bit): Indicates whether the number is positive (0) or negative (1).
  2. Exponent (8 bits): Encodes the exponent, biased by 127 to allow both positive and negative exponents.
  3. Mantissa or Fraction (23 bits): Represents the significant digits of the number.

The formula for FP32 is as follows: \[ \text{Value} = (-1)^{\text{Sign}} \times (1.\text{Mantissa}) \times 2^{\text{Exponent}-127} \] Following this formula, we can calculate FP32 number. Below is an example:

fp32_example

FP32 provides a wider range and higher precision, making it suitable for tasks requiring numerical accuracy, such as training large-scale deep learning models. FP16, with its lower precision, is optimized for speed and memory efficiency. It is particularly effective for inference tasks or mixed-precision training when paired with FP32 for critical calculations.

However, the overflow problem of FP16 arises due to its limited range of representable values. FP16 has a maximum representable value of 65,504 (\(2^{15} \times (2 - \epsilon)\)), which is much smaller compared to FP32's maximum value of approximately \(3.4 \times 10^{38}\). When computations produce results exceeding this range, an overflow occurs, and the value is replaced by infinity (\(\pm \infty\)). Overflow in FP16 can occur during operations like matrix multiplications or summations in deep learning if the intermediate values exceed the maximum representable range. For example, scaling large tensors or performing high-magnitude computations without normalization can easily result in overflow when using FP16. Overflow leads to loss of numerical accuracy and can destabilize training processes in machine learning. It also affects applications like image processing or scientific simulations where precision and stability are critical.

There are some strategies to mitigate this overflow problem:

  • Use mixed-precision training. FP16 is used for most computations but critical operations (e.g., gradient accumulation) are performed in FP32 to prevent overflow.
  • Normalize inputs and intermediate values to keep them within the representable range of FP16.
  • Use alternative formats like BF16, which have a larger dynamic range while maintaining reduced precision.

Googel Brain BF16 uses the same number of exponent bits as FP32 (8 bits), giving it a much larger dynamic range compared to FP16. This means BF16 can represent very large and very small numbers similar to FP32, avoiding underflows and overflows that are common in FP16. Converting from FP32 to BF16 is straightforward because both formats share the same exponent size. The conversion simply involves truncating the mantissa from 23 bits to 7 bits. BF16 uses only 16 bits per value, reducing memory usage by half compared to FP32. This allows for larger batch sizes and models to fit into limited GPU or TPU memory without sacrificing as much numerical range as FP16 does.

Recently, people have started discussing NVIDIA’s FP8 formats (E4M3 and E5M2) as alternatives to BF16 because of their potential to significantly reduce computational and memory costs while maintaining competitive performance in large-scale machine learning tasks. E4M3 offers higher precision, making it suitable for inference and forward-pass computations where precision is critical. E5M2 provides a wider dynamic range, making it ideal for backward-pass computations during training where large gradients can occur. This flexibility allows FP8 to adapt to different stages of training more effectively than BF16.

NVIDIA’s H100 GPUs are specifically designed to support FP8 with optimized Tensor Cores, achieving up to 9x faster training and 30x faster inference compared to previous-generation GPUs using FP16 or BF16. The Hopper architecture dynamically manages precision transitions (e.g., between FP8 and higher-precision formats like FP32), ensuring stability without manual intervention. "Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs" shows that FP8 can deliver similar convergence behavior and accuracy as BF16 in many LLM tasks, with minimal degradation in performance. For inference, FP8 quantization (e.g., E4M3 for KV cache) has been shown to minimally impact accuracy while significantly improving memory efficiency.

However, FP8 comes with challenges such as occasional instability during training (e.g., loss spikes) and sensitivity in certain tasks like code generation or mathematical reasoning. As a result, training LLMs with FP8 precision remains an active area of research and exploration.

summary_5_precision
Feature IEEE 754 FP32 IEEE 754 FP16 Google BF16 NVIDIA FP8 E4M3 NVIDIA FP8 E5M2
Bit Width 32 bits 16 bits 16 bits 8 bits 8 bits
Sign Bit 1 bit 1 bit 1 bit 1 bit 1 bit
Exponent Bits 8 bits (bias = 127) 5 bits (bias = 15) 8 bits (bias = 127) 4 bits (bias = 7) 5 bits (bias = 15)
Mantissa Bits 23 bits 10 bits 7 bits 3 bits 2 bits
Dynamic Range \[ \pm(2^{-126} \text{ to } 2^{127}) \] \[ \pm(2^{-14} \text{ to } 2^{15}) \] \[ \pm(2^{-126} \text{ to } 2^{127}) \] \[ \pm(2^{-6} \text{ to } 2^{7}) \] \[ \pm(2^{-14} \text{ to } 2^{15}) \]
Precision ~7 decimal digits ~3.3 decimal digits ~2.3 decimal digits Lower precision Lower precision
Memory Usage High Medium Medium Low Low
Performance Slower Faster than FP32 Faster than FP32 Much faster than FP16/BF16 Much faster than FP16/BF16
Applications Training requiring high precision Inference or mixed-precision training Mixed-precision training and inference Optimized for inference Optimized for training and inference

Current LLM Training Method in FP8

A training approach has been developed to leverage FP8's efficiency for specific operations while maintaining numerical stability and precision with BF16 for critical components of the model.

During the training process, FP8 is utilized exclusively for computations within the MLP layers, while BF16 is employed for other components of the Transformer architecture, such as Attention, Activation, and Layer Normalization. Both weights and gradients are maintained in BF16 precision.

  • In the forward pass, weights in BF16 are converted to FP8 (E4M3) for matrix multiplications within the MLP layers. Once the computation is completed, the results are immediately converted back to BF16.

  • In the backward pass, gradients in BF16 are temporarily converted to FP8 (E5M2) when passing through the MLP layers. After the computations are performed, the results are promptly converted back to BF16.

fp8_transformer_current

Even when FP8 is used, RAM savings may not be as significant during training because high precision gradients and weights must be maintained in memory to ensure model stability and convergence. The primary benefit of FP8 lies in its ability to reduce memory usage during inference, where weights can be stored in FP8 format, significantly decreasing the memory footprint compared to higher precision formats like FP16 or BF16. Despite this, FP8 is still utilized during training because it allows for faster computations due to its lower precision. This results in accelerated training processes and improved efficiency, especially on hardware optimized for FP8 operations, such as NVIDIA’s H100 GPUs.

Quantization Process

The process of quantization in LLMs refers to a model compression technique that maps high-precision values (e.g., FP32) to lower-precision representations (e.g., INT8 or FP8).

Here is an example of a simple step-by-step quantization from FP16 to INT4:

  1. Range Calculation: Determine the range of FP16 values for the weights or activations. This is typically defined by the minimum and maximum values (\([min, max]\)) in the data.

  2. Scale Factor and Zero-Point Computation: Compute a scaling factor (S) that maps the FP16 range to the INT4 range (\([-8, 7]\) for signed INT4 or \([0, 15]\) for unsigned INT4). Optionally, calculate a zero-point (Z) to handle asymmetric quantization, where zero in FP16 does not align with zero in INT4.

    The formula for quantization is: \[ x_q = \text{round}\left(\frac{x}{S} + Z\right) \] where \(x_q\) is the quantized INT4 value, \(x\) is the original FP16 value, \(S\) is the scaling factor, and \(Z\) is the zero-point.

  3. Quantization: Map each FP16 value to its corresponding INT4 representation using the computed scale factor and zero-point. This step reduces precision but compresses the data significantly.

There are different types of quantization:

  • Asymmetric Quantization vs Summetric Quantization
  • Uniform Quantization vs Non-uniform Quantization

Quant in General Matrix Multiply (GEMM)

Quantized matrices are stored in memory in their compressed form. During matrix multiplication, these matrices are dequantized back to higher precision (e.g., FP16 or FP32) to perform computations. This process balances memory efficiency with computational precision.

Quantization can be applied at different levels of granularity, which determines how scaling factors are assigned and used. The "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models" paper introduced several quantization granularity techniques, including per-tensor quantization, per-token quantization, and per-channel quantization:

  1. Per-Tensor Quantization: A single scaling factor is applied to the entire tensor (e.g., a weight matrix or activation matrix). It is highly memory-efficient since only one scaling factor needs to be stored. But it is not recommended in practice because outlier values can dominate the scaling factor, leading to significant quantization errors for the rest of the tensor.
  2. Per-Channel Quantization: Each channel (e.g., each column of a weight matrix or each feature map in activations) has its own scaling factor. Commonly used for weight matrices in neural networks. It mitigates the impact of outliers by isolating them within individual channels, improving quantization accuracy compared to per-tensor methods. But it can introduce computational overhead during dequantization due to the need for multiple scaling factors.
  3. Per-Token Quantization: Each token's activations are assigned a unique scaling factor. Typically used for activations in transformer models. It captures token-specific variations in activations, leading to better precision for tasks with dynamic token distributions. Per-token quantization can be computationally expensive and slower because it requires more scaling factors and additional computations.
  4. Group-Wise Quantization (GWQ): this method groups multiple channels or tokens together and applies a shared scaling factor across the group. It reduces computational overhead compared to per-channel or per-token quantization while maintaining finer granularity than per-tensor methods. It's often used for both weights and activations to strike a balance between accuracy and efficiency.
quant_granularity

QLoRA

Fine-Tuning Cost

Comparing cost of full fine tuning, LoRA fine tuning, and QLoRA fine tuning:

finetune_cost
Full Finetuning LoRA QLoRA
Weight 16 bits 16 bits 4 bits
Weight Gradient 16 bits ~0.4 bits ~0.4 bits
Optimizer stage 64 bits ~0.8 bits ~0.8 bits
Adapter weights / ~0.4 bits ~0.4 bits
Totel 96 bits per parameter 17.6 bits per parameter 5.2 bits per parameter

QLoRA's Contributions

Paper: QLoRA: Efficient Finetuning of Quantized LLMs

4-bit NormalFloat Quantitzation

4-bit NormalFloat Quantitzation adopts the idea of Quantile Quantization which is an information-theoretic method that maps values based on quantiles of the weight distribution. It's a method of data compression where data is quantized (reduced to a smaller set of discrete values) in a way that aims to minimize the information entropy of the resulting data, essentially achieving the most efficient representation possible while still introducing some loss in information, making it a "lossy minimum entropy encoding" technique.

To compute the quantile function for 4-bit NormalFloat (NF4) quantization, the process involves mapping cumulative probabilities to quantization levels optimized for normally distributed data. The quantile function is the inverse of the cumulative distribution function (CDF). For example, as shown in the description, if the probability of \(x < 1.2\) is 0.9, then 1.2 is the corresponding quantile for a cumulative probability of 0.9.

quantile_function

With this quantile function, the probability range from 0 to 1 is divided into 16 equal-sized buckets, as 4 bits can represent \(2^4 = 16\) distinct values. The steps are as follows:

  1. Divide the Probability Range: The range of cumulative probabilities \([0, 1]\) is divided into 16 equal intervals or "buckets." These intervals represent equal portions of the probability mass.
  2. Apply the Quantile Function: For each bucket's cumulative probability value (e.g., \(p_i = \frac{i}{16}\), where \(i \in [1, 15]\)), the corresponding quantile value is computed using the inverse CDF of a standard normal distribution (\(\Phi^{-1}(p_i)\)).
  3. Normalize Quantiles: The resulting quantiles are normalized to fit within a predefined range, typically \([-1, 1]\). This ensures that all quantization levels are symmetrically distributed around zero and fall within a compact range suitable for efficient representation.
  4. Assign NF4 Values: The normalized quantiles become the 16 discrete values used by NF4 to represent weights or activations in a compressed format. These values are spaced closer together near zero (where most of the normal distribution's probability mass lies) and farther apart at the extremes, optimizing precision where it matters most.

Double Quantization

Double Quantization (DQ) as introduced in paper "QLoRA: Efficient Finetuning of Quantized LLMs" is a memory optimization technique that quantizes the quantization constants themselves to further reduce the memory footprint of LLMs. It involves two quantization steps:

  1. The first quantization involves quantizing the original weights of the model into 4-bit NormalFloat (NF4) format. Weights are divided into small blocks (e.g., 64 elements per block), and each block is scaled by a quantization constant (also known as a scaling factor). This constant ensures that the range of values in each block fits within the representable range of NF4. The quantized weights and their corresponding quantization constants are stored. However, these constants (usually in FP32) can add significant memory overhead.

    To calculate the memory overhead: for a block size of 64, storing a 32 bit quantization constant for each block adds \(32/64=0.5\) bits per parameter on average.

  2. The second quantization aims to reduce the memory overhead caused by storing quantization constants. Those quantization constants \(c^{FP32}_2\) are further quantized into 8-bit floating-point values (FP8) with a larger block size (e.g., 256 elements per block). This is a summetric quantization where the mean of the first level factors \(c^{FP32}_2\) is subtracted to center their distribution around zero. This reduces their memory footprint while maintaining sufficient precision for scaling operations. Additionally, another set of quantization constants \(c^{FP32}_1\) is introduced to scale these second-level quantized values.

    To calculate the memory savings: after double quantization, the memory footprint per parameter for scaling factors is reduced from \(32/64=0.5\) bits to \(8/64 + 32/(64\times 256)=0.127\) bits per parameter. This results in saving \(0.5-0.127=0.373\) bits per parameter.

double_quant

The authors of paper "QLoRA: Efficient Finetuning of Quantized LLMs" compared LLaMA models with different 4-bit data types. They show that the NormalFloat data type significantly improves the bit-for-bit accuracy gains compared to regular 4-bit Floats. While Double Quantization only leads to minor gains, it allows for a more fine-grained control over the memory footprint to fit models of certain size (33B/65B) into certain GPUs (24/48GB). This empirical results show that using FP8 for second-level quantization does not degrade model performance, making it an effective trade-off between precision and memory efficiency.

nf4_compare_result

Paged Optimizers

As described in the QLoRA paper, paged optimizers are a memory management innovation that leverages NVIDIA Unified Memory to handle the memory spikes that occur during gradient checkpointing or when processing large mini-batches with long sequence lengths. NVIDIA Unified Memory allows seamless memory sharing between the GPU and CPU. When the GPU runs out of memory during training, optimizer states (e.g., gradients, momentum, or scaling factors) are paged out (evicted) to CPU RAM. These states are paged back into GPU memory only when needed for computations like gradient updates.

Forward and Backward Implementation

Forward:

qlora_forward

Backward:

qlora_backward

QLora Usage

QLoRA utilizes bitsandbytes for quantization and is seamlessly integrated with Hugging Face's PEFT and transformers libraries, making it user-friendly. To explore the implementation further, let's dive into the QLoRA code and examine the train() function in qlora.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def train():
hfparser = transformers.HfArgumentParser((
ModelArguments, DataArguments, TrainingArguments, GenerationArguments
))
model_args, data_args, training_args, generation_args, extra_args = \
hfparser.parse_args_into_dataclasses(return_remaining_strings=True)
training_args.generation_config = transformers.GenerationConfig(**vars(generation_args))
args = argparse.Namespace(
**vars(model_args), **vars(data_args), **vars(training_args)
)
print(args)

checkpoint_dir, completed_training = get_last_checkpoint(args.output_dir)
if completed_training:
print('Detected that training was already completed!')

model, tokenizer = get_accelerate_model(args, checkpoint_dir)

......

The get_accelerate_model() function initializes your model and is a crucial component of implementing QLoRA. Notably, within the AutoModelForCausalLM.from_pretrained() method, it loads the quantization configuration through BitsAndBytesConfig. This setup ensures that the model weights are automatically quantized.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def get_accelerate_model(args, checkpoint_dir):

if torch.cuda.is_available():
n_gpus = torch.cuda.device_count()
if is_ipex_available() and torch.xpu.is_available():
n_gpus = torch.xpu.device_count()

max_memory = f'{args.max_memory_MB}MB'
max_memory = {i: max_memory for i in range(n_gpus)}
device_map = "auto"

# if we are in a distributed setting, we need to set the device map and max memory per device
if os.environ.get('LOCAL_RANK') is not None:
local_rank = int(os.environ.get('LOCAL_RANK', '0'))
device_map = {'': local_rank}
max_memory = {'': max_memory[local_rank]}


if args.full_finetune: assert args.bits in [16, 32]

print(f'loading base model {args.model_name_or_path}...')
compute_dtype = (torch.float16 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32))
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
cache_dir=args.cache_dir,
load_in_4bit=args.bits == 4,
load_in_8bit=args.bits == 8,
device_map=device_map,
max_memory=max_memory,
quantization_config=BitsAndBytesConfig(
load_in_4bit=args.bits == 4,
load_in_8bit=args.bits == 8,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=args.double_quant,
bnb_4bit_quant_type=args.quant_type,
),
torch_dtype=(torch.float32 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32)),
trust_remote_code=args.trust_remote_code,
use_auth_token=args.use_auth_token
)
......

Other than some necessary components like tokenizer, train() gives an option of LoRA in addition to full finetune. It requires setup of LoRA config and get_peft_model function from peft package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def train():
......
if not args.full_finetune:
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=args.gradient_checkpointing)

if not args.full_finetune:
if checkpoint_dir is not None:
print("Loading adapters from checkpoint.")
model = PeftModel.from_pretrained(model, join(checkpoint_dir, 'adapter_model'), is_trainable=True)
else:
print(f'adding LoRA modules...')
modules = find_all_linear_names(args, model)
config = LoraConfig(
r=args.lora_r,
lora_alpha=args.lora_alpha,
target_modules=modules,
lora_dropout=args.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
......

Not every layers are quantized. QLoRA only quantizes linear projection layers. Some layers like Layer Norm is sensitive to precision, so high precision is required.

0%