2024 May

Posted on 2024-05-15 Edited on 2024-06-20

Substack

To me, the best model going forward is going to be based on the weighted performance per parameter and training token count. Ultimately, a model keeps getting better the longer you train it. Most open model providers could train longer, but it hasn’t been worth their time. We’re starting to see that change.

The most important models will represent improvements in capability density, rather than shifting the frontier.

In some ways, it’s easier to make the model better by training longer compared to anything else, if you have the data.

The core difference between open and closed LLMs on these charts is how undertrained open LLMs often are. The only open model confirmed to be trained on a lot of tokens is DBRX.

― The End of the “Best Open LLM” - Interconnects [Link]

Good analysis of the direction of open LLM development in 2023 and 2024. In 2023, models were progressing in MMLU by leveraging more compute budgets to handle scaled active parameters and training tokens. In 2024, the progressing direction is slightly changed to be orthogonal to previous - which is improving on MMLU while keeping compute budgets constant.

The companies that have users interacting with their models consistently have moats through data and habits. The models themselves are not a moat, as I discussed at the end of last year when I tried to predict machine learning moats, but there are things in the modern large language model (LLM) space that open-source will really struggle to replicate. Concretely, that difference is access to quality and diverse training prompts for fine-tuning. While I want open-source to win out for personal philosophical and financial factors, this obviously is not a walk in the park for the open-source community. It’ll be a siege of a castle with, you guessed it, a moat. We’ll see if the moat holds.

― Model commoditization and product moats - Interconnects [Link]

The goal of promoting scientific understanding for the betterment of society has a long history. Recently I was pointed to the essay The Usefulness of Useless Knowledge by Abraham Flexner in 1939 which argued how basic scientific research without clear areas for profit will eventually turn into societally improving technologies. If we want LLMs to benefit everyone, my argument is that we need far more than just computer scientists and big-tech-approved social scientists working on these models. We need to continue to promote openness to support this basic feedback loop that has helped society flourish over the last few centuries.

The word openness has replaced the phrase open-source among most leaders in the open AI movement. It’s the easiest way to get across what your goals are, but it is not better in indicating how you’re actually supporting the open ecosystem. The three words that underpin the one messy word are disclosure (the details), accessibility (the interfaces and infrastructure), and availability (the distribution).

― We disagree on what open-source AI should mean - Interconnects [Link]

Google: “A Positive Moment” [Link]

The report of Google Search’s death is exaggerated so far. In fact, search advertising has grown faster at Google than at Microsoft. User searching behavior is harder to change than people expected. Also, Google is leading the development of AI powered tools for Search: 1) “circle to search” is feature allowing a search from an image, text, or video without switching apps. 2) “Point your camera, ask a question” is a feature allowing for multisearch with both images and text for complex questions given an image to the tool. Overall, SGE (Search Generative Experience) is revolutionizing search experience (“10 blue links”) by introducing a dynamic AI-enhanced experience. So far from I observed AI powers Google Search rather than weakens it.

Amazon: Wild Margin Expansion - App Economy Insights [Link]

Amazon’s margin expansion: AWS hit $100 B run rate with a 38% operating margin; Ads is surging; delivery costs have been reduced.

The biggest risk is not correctly projecting demand for end-user AI consumption, which would threaten the utilization of the capacity and capital investments made by tech firms today. This would leave them exposed at the height of the valuation bubble, if and when it bursts, just like Cisco’s growth story that began to unravel in 2000. After all, history may not repeat, but it often rhymes.

At the Upfront Ventures confab mentioned earlier, Brian Singerman, a partner at Peter Thiel’s Founders Fund, was asked about contrarian areas worth investing in given the current landscape. His response: “Anything not AI”.

― AI’s Bubble Talk Takes a Bite Out Of The Euphoria - AI Supremacy [Link]

When we talk about investment, we talk about economic values. Current situation of AI is very similar to Cisco’s in 2000. Cisco as an internet company spread the capacity of the World Wide Web, but sooner people realized that there is no economic value in internet company, instead, opportunities are in e-commerce etc. AI is a tool very similar to web tech. Currently, with heightened expectations, people are allocating investments and capital expenditure in AI model development, however, end-user demand is unclear and revenue is relatively minimal. This situation makes AI look like a bubble from a very long term perspective.

Steve Jobs famously said that Apple stands at the intersection of technology and liberal arts. Apple is supposed to enhance and improve our lives in the physical realm, not to replace cherished physical objects indiscriminately.

― Apple’s Dystopian iPad Video - The Rational Walk Newsletter [Link]

Key pillars of the new strategy (on gaming):

Expanding PC and cloud gaming options.

Powerful consoles (still a core part of the vision).

Game Pass subscriptions as the primary access point.

Actively bringing Xbox games to rival platforms (PS5, Switch).

Exploring mobile gaming with the potential for handheld hardware.

Microsoft’s “every screen is an Xbox” approach is a gamble and may take a long time to pay off. But the industry is bound to be device-agnostic over time as it shifts to the cloud and offers cross-play and cross-progression. It’s a matter of when not if.

― Microsoft: AI Inflection - App Economy Insights [Link]

Highlights: Azure’s growth accelerated sequentially thanks to AI services and was the fastest-growing of the big three (Amazon AWS, Google Cloud, Microsoft Azure). On Search, Microsoft is losing market share to Alphabet. Capex on AI grows roughly 80% YoY. On gaming, it’s diversifying approaches from selling consoles. Copilot and the Office succeed with Enterprise customers.

To founders, my advice is to remain laser-focused on building products and services that customers love, and be thoughtful and rational when making capital allocation decisions. Finding product-market fit is about testing and learning from small bets before doubling down, and it is often better to grow slower and more methodically as that path tends to lead to a more durable and profitable business. An axiom that doesn’t seem to be well understood is that the time it takes to build a company is also often its half-life.

― 2023 Annual Letter - Chamath Palihapitiya [Link]

This is a very insightful letter about how economic and tech trends of 2023 have shaped their thinking and investment portfolio. What I have learned from this letter:

Tech industry has shifted their focus from unsustainable “growth at any cost” to more prudent forms of capital allocation. This results in laying off employees and slashing projects that are not relevant to the core business.
Rising of interest rate is one of the reasons of bank crisis. During zero interest rate decade, banks sought higher rates of return by purchasing longer duration assets while the value of them are negatively correlated to interest rate. As those caused losses are known by the public, a liquidity crisis ensued.
The advancement of Gen AI has lowered the barriers of starting a software company, and lowered capital requirement in Bio Tech and material sciences, and changed the process of building companies fundamentally, and empowered new entrants to challenge established businesses.
- The key question is: where will value creation and capture take place? when and where should capital be allocated and company should be started? Some author’s opinions:
  - It’s premature to declare winners now. Instead, author suggested people should deeply understand the underlying mechanisms that will be responsible for value creation over next few years.
  - There are at least two areas of value creation now
    1. Proprietary data
      
      Example: recent partnership between Reddit and Google
    2. Infrastructure used to run AI application
      
      For apps built on top of language models, responsiveness is a critical lynchpin. However GPUs are not well-suited to run inference.
      
      Example: Author’s investment in Groq’s LPU for inference
Heightened geopolitical tensions due to Russia-Ukraine conflict, Israel and Hamas, escalating tensions between China and Taiwan, resulted in a de-globalization trend and also a strategic shift in the US. US legislative initiatives aims to fuel a domestic industrial renaissance by incentivizing reshoring and fostering a more secure and resilient supply chain. They include CHIPS Act, Infrastructure Investment, Job Act, Inflation Reduction Act, etc.
- The author highlights the opportunity for allocators and founders: companies can creatively and strategically tap into different pools of capital-debt, equity, and government funding.

OpenAI’s strategy to get its technology in the hands of as many developers as possible — to build as many use cases as possible — is more important than the bot’s flirty disposition, and perhaps even new features like its translation capabilities (sorry). If OpenAI can become the dominant AI provider by delivering quality intelligence at bargain prices, it could maintain its lead for some time. That is, as long as the cost of this technology doesn’t drop near zero.

A tight integration with Apple could leave OpenAI with a strong position in consumer technology via the iPhone and an ideal spot in enterprise via its partnership with Microsoft.

― OpenAI Wants To Get Big Fast, And Four More Takeaways From a Wild Week in AI News - Big Technology [Link]

As GPT-4o is 2x faster and 50% cheaper, this discourages competitors to develop LLMs to compete and encourages companies to build with OpenAI’s model for their business. This shows that OpenAI wants to get big fast. However, making GPT-4o free disincentivizes users from subscribing the Plus version.

There is a tight and deep bond between OpenAI and Apple. The desktop app has been debuted on Mac and Apple will build OpenAI’s GPT Tech into mobile iOS.

“You can borrow someone else’s stock ideas but you can’t borrow their conviction. True conviction can only be obtained by trusting your own research over that of others. Do the work so you know when to sell. Do the work so you can hold. Do the work so you can stand alone.”

Investing isn’t about blindly following the herd. It’s about carving your own path, armed with knowledge, patience, and a relentless pursuit of growth and learning.

― Hedge Funds’ Top Picks in Q1 - App Economy Insights [Link]

As I’ve dug into this in more detail, I’ve become convinced that they are doing something powerful by searching over language steps via tree-of-thoughts reasoning, but it is much smaller of a leap than people believe. The reason for the hyperbole is the goal of linking large language model training and usage to the core components of Deep RL that enabled success like AlphaGo: self-play and look-ahead planning.

To create the richest optimization setting, having the ability to generate diverse reasoning pathways for scoring and learning from is essential. This is where Tree-of-Thoughts comes in. The prompting from ToT gives diversity to the generations, which a policy can learn to exploit with access to a PRM.

Q seems to be using PRMs to score Tree of Thoughts reasoning data that then is optimized with Offline RL. This wouldn’t look too different from existing RLHF toolings that use offline algorithms like DPO or ILQL that do not need to generate from the LLM during training. The ‘trajectory’ seen by the RL algorithm is the sequence of reasoning steps, so we’re finally doing RLHF in a multi-step fashion rather than contextual bandits!

Let’s Verify Step by Step: a good introduction to PRMs.

― The Q hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data - Interconnects* [Link]

It’s well known on the street that Google DeepMind has split all projects into three categories: Gemini (the large looming model), Gemini-related in 6-12months (applied research), and fundamental research, which is oddly only > 12 months out. All of Google DeepMind’s headcount is in the first two categories, with most of it being in the first.

Everyone on Meta’s GenAI technical staff should spend about 70% of the time directly on incremental model improvements and 30% of the time on ever-green work.

A great read from Francois Chollet on links between prompting LLMs, word2vec, and attention. One of the best ML posts I’ve read in a while.

Slides from Hyung Won Chung’s (OpenAI) talk on LLMs. Great summary of intuitions for the different parts of training. The key point: We can get further with RLHF because the objective function is flexible.

― The AI research job market shit show (and my experience) - Interconnects [Link]

10 Lessons From 2024 Berkshire Hathaway Annual Shareholder Meeting - Capitalist Letters [Link]

What I’ve learned from this article:

Why did Berkshire trimmed its APPL position?

No concern about Apple’s earnings potential, make sense to take some profits as value is now too high.
Right way to look at share buybacks

A business should pay dividends only if it cannot make good use of the excess capital it has. Good use capital means the Return of Equity, which is on average 12% for American companies. If the company is able to allocate capital better than shareholders themselves and provide them with above average returns, it should retain the earnings and allocate capital itself.

Buybacks only makes sense at the right price and buying back shares just to support stock price is not the best action ti take for shareholders. All investment decisions should be price dependent.
How would he invest small sums of money?

At the time of market crashes or economic downturns, you find exceptional companies trading at ridiculously cheap prices and that’s your opportunity, When you find those companies fairly priced or overvalued and you look for special situations while holding onto your positions in those exceptional companies.
Views on capital allocation

Study picking businesses, not stocks.
Investing in foreign countries

America has been a great country for building wealth and capitalist democracy is the best system of governance ever invented.
Advice on job picking

Remember Steve Jobs’ famous words in the Stanford Commencement speech he gave before his death: “Keep looking, don’t settle!”
On the importance of culture

In Berkshire culture, shareholders feel themselves as the owners of the businesses. Greg Abel will keep the culture alive in the post-Buffett period and this will automatically attract top talent to a place where they are given full responsibility and trust.
When to sell stocks
1. A bigger opportunity comes up, 2. something drastically changes in the business, and 3. to raise money
Effects of consumer behavior on investment decisions

Two types of businesses have durable competitive advantage: 1) Lowest cost suppliers of products and services, 2) suppliers of unique products and services.
How to live a good life?

“I’ve written my obituary the way I’ve lived my life”‘ - Charlie Munger

NVIDIA: Industrial Revolution - App Economy Insights [Link]

Primary drivers of Data Center Revenue: 1) Strong demand (up 29% sequentially) for the Hopper GPU computing platform used for training and inferencing with LLMs, recommendation engines, and GenAl apps, 2) InfiniBand end-to-end solutions (down 5% sequentially due to timing of supply) for networking. NVIDIA started shipping the Spectrum-X Ethernet networking solutions optimized for Al.

In the earning call, three major customer categories are provided: 1) cloud service providers (CSPs) including hyperscalers Amazon Microsoft and Google. 2) enterprise usage: Tesla expanded training Al cluster to 35000 H100 GPUs and used NVIDIA Al for FSD V12. 3) consumer internet companies: Meta’s Llama 3 powering Meta Al was trained on a cluster of 24000 H100 GPUs.

Huang explained in the earning call that AI is not a chip problem only but also a systems problem now. They build AI factories.

For further growth, Blackwell platform is coming, Spectrum-X networking is expanding, new software tools like NIMs is developing.

A lot of current research focuses on LLM architectures, data sources prompting, and alignment strategies. While these can lead to better performance, such developments have 3 inter-related critical flaws-

They mostly work by increasing the computational costs of training and/or inference.

They are a lot more fragile than people realize and don’t lead to the across-the-board improvements that a lot of Benchmark Bros pretend.

They are incredibly boring. A focus on getting published/getting a few pyrrhic victories on benchmarks means that these papers focus on making tweaks instead of trying something new, pushing boundaries, and trying to address the deeper issues underlying these processes.

― Revolutionizing AI Embeddings with Geometry [Investigations] - Devansh [Link]

Very few AI research work don’t have # 1 and # 3 flaws and they are really good hard-core work. Time is required to verify whether they are generalizable, widely applicable or not. Especially nowadays the process of scientific research is very different from previous years where there was usually a decade between starting your work and publishing it.

This article highlights some publications in complex embedding and looked into how they improved embeddings by using complex numbers. Current challenges in embedding are 1) sensitivity to outliers 2) limited capacity in capture complex relationship in unstructured text, 3) inconsistency in pairwise rankings of similarities, and 4) computational cost. The next generation complex embedding is benefitting from the following pillars: 1) complex geometry provides richer space to capture nuanced relationships and handle outliers, 2) orthogonality allows each dimension to be independent and distinct, 3) contrastive learning can be used to minimize the distance between similar pairs and maximize the distance between dissimilar pairs. Complex embeddings have a lot of advantages: 1) increasing representation capacity with two components (real and imaginary) of complex numbers, 2) complex geometry allows for orthogonality and thus improves generalization, and also allows use to reach stable convergence quickly, 3) robust features can be captured which improves robustness, and 4) solved limitation of cosine similarity (saturation zones which lead to vanishing gradients during optimization) by angle optimization in complex space.

Llama 3 8B might be the most interesting all-rounder for fine-tuning as it can be fine-tuned no a single GPU when using LoRA.

Phi-3 is very appealing for mobile devices. A quantized version of it can run on an iPhone 14.

― How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? [Link]

Good paper review article. Highlights key discussions:

Mixtral 8x22B: The key idea is to replace each feed-forward module in a transformer architecture with 8 expert layers. It achieves lower active parameters (cost) and higher performance (MMLU).
Llama 3: The main difference between Llama 3 and Llama 2 are 1) vocab size has been increased, 2) used grouped-query attention, 3) used both PPO & DPO. The key research finding is that the more data the better performance, no matter what model size is.

“Llama 3 8B might be the most interesting all-rounder for fine-tuning as it can be fine-tuned no a single GPU when using LoRA.”
Phi-3: Key characteristics are 1) it’s based on Llama architecture, 2) trained on 5x fewer tokens than Llama 3, 3) used the same tokenizer with a vocab size of 32064 as Llama2, much smaller than Llama 3 vocab size, 4) has only 3.8B parameters, less than half the size of Llama 3 8B, 5) secret sauce is dataset quality over quantity - it’s trained on heavily filtered web data and synthetic data.

“Phi-3 is very appealing for mobile devices. A quantized version of it can run on an iPhone 14.”
OpenELM: key characteristics are 1) 4 relatively small sizes: 270M, 450M,1.1B, and 3B, 2) instruct version trained with rejection sampling and DPO, 3) slightly better than OLMo in performance, even though trained on 2x fewer tokens, 4) main architecture teak - a layer-wise scaling strategy, 5) sampled a relatively smaller subset of 1.8T tokens from various public datasets, but no clear rationale for subsampling, 6) one main research finding is that there is no clear difference between LoRA and DoRA for parameter efficient fine-tuning.

About the layer-wise scaling strategy: 1) there are N transformer blocks in a model, 2) layers are gradually widened from the early to the later transformer blocks, so for each block: a) number of heads are increased, b) dimension of each layer is increased.
DPO vs PPO: The main difference between DPO and PPO is that “DPO does not require training a separate reward model but uses a classification-like objective to update LLM directly”.

Key findings of the paper and best practices suggested: 1) PPO is generally better than DPO if you use it correctly. DPO suffers from out-of-distribution data, which means instruction data is different from preference data. The solution could be to “add a supervised instruction fine-tuning round on the preference dataset before following up with DPO fine-tuning.”, 2) If you use DPO, make sure to perform SFT on preference data first, 3) “iterative DPO which involves labeling additional data with an existing reward model is better than DPO on existing preference data.”, 4) “If you use PPO, the key is to use large batch sizes, advantage normalization, and parameter update via exponential moving average.”, 5) though PPO is generally better, DPO is more straightforward and will still be a popular go-to option, 6) both can be used. Recall the pipeline behind Llama3: pretraining -> SFT -> rejection sampling -> PPO -> DPO.

Google I/O AI keynote updates 2024 - AI Supremacy [Link]

Streaming Wars Visualized - App Economy Insights [Link]

This Week in Visuals - App Economy Insights [Link]

Gig Economy Shakeup - App Economy Insights [Link]

Articles

Musings on building a Generative AI product - LinkedIn Engineering Blog [Link]

This is a very good read about developing Gen AI product for business by using pre-trained LLM. This article elaborates how this product is designed, how each part works specifically, what works and what does not work, what has been improving, and what has been struggling. Some takeaways for me are

Supervised fine tuning step was done by embedding-based retrieval (EBR) powered by an in-memory database to inject response examples into prompts.
An organizational structure was designed to ensure communication consistency: one horizontal engineering pod for global templates and styles, and several vertical engineering pods for specific tasks such as summarization, job fit assessment, interview tips, etc.
Tricky work:
1. Developing end to end automatic evaluation pipeline.
2. Skills in dynamically discover and invoke APIs / agents.
  
  This requires input and output to be ‘LLM friendly’ - JSON or YAML schemes.
3. Supervised fine tuning by injected responses of internal database.
  
  As evaluation becoming more sophisticated, prompt engineering needs to be improved to reach high quality/evaluation scores. The difficulty is that quality scores shoot up fast then plateau so it’s hard to reach a very high score in the late improvement stage. This makes prompt engineering more like an art rather than science.
4. Tradeoff of capacity and latency
  
  Chain of Thoughts can improve quality and accuracy of responses, but increase latency. TimeToFirstToken (TTFT) & TimeBetweenTokens (TBT) are important to utilization but need to be bounded to limit latency. Besides, they also intend to implement end to end streaming and async non-blocking pipeline.

The concept of open source was devised to ensure developers could use, study, modify, and share software without restrictions. But AI works in fundamentally different ways, and key concepts don’t translate from software to AI neatly, says Maffulli.

But depending on your goal, dabbling with an AI model could require access to the trained model, its training data, the code used to preprocess this data, the code governing the training process, the underlying architecture of the model, or a host of other, more subtle details.

Which ingredients you need to meaningfully study and modify models remains open to interpretation.

both Llama 2 and Gemma come with licenses that restrict what users can do with the models. That’s anathema to open-source principles: one of the key clauses of the Open Source Definition outlaws the imposition of any restrictions based on use cases.

All the major AI companies have simply released pretrained models, without the data sets on which they were trained. For people pushing for a stricter definition of open-source AI, Maffulli says, this seriously constrains efforts to modify and study models, automatically disqualifying them as open source.

― The tech industry can’t agree on what open-source AI means. That’s a problem. ― MIT Technology Review [Link]

This article argues that the definitions of open-source AI are problematic. ‘Open’ models either have restriction on usage or don’t release details of training data. This does not fit traditional definition of ‘open source’. However, people argue that for the special case of AI, we need different definition of open source. As long as the definition remains vague, it’s problematic, because big tech will define open-source AI to be what suits it.

Everything I know about the XZ backdoor [Link]

Some great high-level technical overview of XZ backdoor [Link] [Link] [Link] [Infographic] [Link] [Link]

A backdoor in xz-utils (used for lossless compression) was recently revealed by Andres Freund (Principle SDE at Microsoft). The backdoor only shows up when a few specific criteria are met at least: 1) running a distro that uses glibc, 2) have version 5.6.0 or 5.6.1 xz installed or liblzma installed. There is a malicious script called build-to-host.m4 which checks for various conditions like the architecture of the machine. If those conditions check, the payload is injected into the source tree. The intention of payload is still under investigation. Lasse Collin, one of the maintainer of the repo, has posted an update and is working on carefully analyzing the situation. The author Evan Boehs in the article present a timeline of the attack and online investigators’ discoveries of Jia Tan identity (from IP address, LinkedIn, commit timings, etc), and raises our awareness of the human costs of open source.

Having a crisp mental model around a problem, being able to break it down into steps that are tractable, perfect first-principle thinking, sometimes being prepared (and able to) debate a stubborn AI — these are the skills that will make a great engineer in the future, and likely the same consideration applies to many job categories.

― Why Engineers Should Study Philosophy ― Harvard Business Review [Link]

Human comes into a new stage of learning: smartly asking AI questions to get answers as accurate as possible. So prompt engineering is a very important skill in AI era. In order to master prompt engineering, we need to have divide and conquer mindset, perfect first-principle thinking, critical thinking, and skepticism.

If we had infinite capacity for memorisation, it’s clear the transformer approach is better than the human approach - it truly is more effective. But it’s less efficient - transformers have to store so much information about the past that might not be relevant. Transformers (🤖) only decide what’s relevant at recall time. The innovation of Mamba (🐍) is allowing the model better ways of forgetting earlier - it’s focusing by choosing what to discard using Selectivity, throwing away less relevant information at memory-making time.

― Mamba Explained [Link]

A very in-depth explanation of Mamba architecture. So the main difference between Transformer and Mamba is that Transformer stores all past information and decides what is relevant at recall time. While Mamba uses Selectivity to decide what to discard earlier. Mamba ensures both efficiency and effectiveness (space complexity reduces from O(n) to O(1), time complexity reduces from O(n^2) to O(n)). If Transformer has high effectiveness and low efficiency due to large state, and RNN has high efficiency and low effectiveness due to small state, Mamba is in between - Mamba selectively and dynamically compress data into the state.

The Power of Prompting ― Microsoft Research Blog [Link]

Basically this study demonstrates that GPT-4 is able to outperform a leading model that was fine-tuned specifically for medical application by Medprompt - a composition of several prompting strategies. This shows that fine-tuning might not be necessary in the future though it can boost performance, it is resource-intensive and cost-prohibitive. Simple prompting strategies could serve to transform generalist models into specialists and extending benefits of models to new domains and applications. Similar study was also done in finance domain by JP Morgan with similar results.

Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called “dictionary learning”, borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts.

In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an Al model is made by combining neurons, and every internal state is made by combining teatures.

The features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior.

― Mapping the Mind of a Large Language Model - Anthropic [Link]

This is an amazing work towards AI safety by Anthropic. The main goal is to understand the inner workings of AI models and identify how millions of concepts are represented inside Claude Sonnet, so that developers can better control AI safety. Previous progress of this work was to match pattern of neuron activations (“features”) to human-interpretable concepts by technique called “dictionary learning”. Now they are scaling up the technique to the vastly larger AI language models. Below is a list of key experiments and findings.

Extracted millions of features from the middle layer of Claude 3.0 Sonnet. Features have a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities.
Find more abstract features - responding to bugs in code, discussion of gender biases in professions, etc.
Measure a “distance” between features based on which neurons appeared in their activation patterns. They find that features with similar concept are close to each other. This demonstrates internal organization of concepts in AI model correspond to human notions of similarity.
By artificially amplifying or suppressing features, they see how Claude’s responses change. This shows that features can be used to change how a model acts.
For the purpose of AI safety, they find features corresponding to the capabilities with misuse potential (code backdoors, developing bio-weapons), different forms of biases (gender discrimination, racist claims about crime), and potentially problematic AI behavior (power-seeking, manipulation, secrecy)
For previous concern about sycophancy, they also find a feature associated with sycophantic praise.

This study proposed a good approach to ensure AI safety: use the technique described here to monitor AI systems for dangerous behaviors and to debias outcomes.

To qualify as a “Copilot+ PC” a computer needs distinct CPUs, GPUs, and NPUs (neural processing units) capable of >40 trillion operations per second (TOPS), and a minimum of 16 GB RAM and a 256 GB SSD.

All of those analysts who assumed Wal-Mart would squish Amazon in e-commerce thanks to their own mastery of logistics were like all those who assumed Microsoft would win mobile because they won PCs. It turns out that logistics for retail are to logistics for e-commerce as operating systems for a PC are to operating systems for a phone. They look similar, and even have the same name, but require fundamentally different assumptions and priorities.

I then documented a few seminal decisions made to demote windows, including releasing Office on iPad as soon as he took over, explicitly re-orienting Microsoft around services instead of devices, isolating the Windows organization from the rest of the company, killing Windows Phone, and finally, in the decision that prompted that Article, splitting up Windows itself. Microsoft was finally, not just strategically but also organizationally, a services company centered on Azure and Office; yes, Windows existed, and still served a purpose, but it didn’t call the shots for the rest of Microsoft’s products.

That celebration, though, is not because Windows is differentiating the rest of Microsoft, but because the rest of Microsoft is now differentiating Windows. Nadella’s focus on AI and the company’s massive investments in compute are the real drivers of the business, and, going forward, are real potential drivers of Windows.

This is where the Walmart analogy is useful: McMillon needed to let e-commerce stand on its own and drive the development of a consumer-centric approach to commerce that depended on centralized tech-based solutions; only then could Walmart integrate its stores and online services into an omnichannel solution that makes the company the only realistic long-term rival to Amazon.

Nadella, similarly, needed to break up Windows and end Ballmer’s dreams of vertical domination so that the company could build a horizontal services business that, a few years later, could actually make Windows into a differentiated operating system that might, for the first time in years, actually drive new customer acquisition.

― Windows Returns - Stratechery [Link]

Chatbot Arena results are in: Llama 3 dominates the upper and mid cost-performance front (full analysis) ― Reddit [Link]

Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora [Link]

YouTube

I don’t have an answer to peace in the Middle East, I wish I did, but I do have a very strong view that we are not going to get to peace when we are apologizing or denying crimes against humanity and crime mass rape of women. That’s not the path to peace, the path to peace is not saying this didn’t happen, the path to peace is saying this happened no matter what side of the fence you are on no matter what side of the world you are on, if you are the far right the far left, anywhere on the world, we are not going to let this happen again and we are going to get to peace to make sure. - Sheryl Sandberg

― In conversation with Sheryl Sandberg, plus open-source AI gene editing explained - All-In Podcast [Link]

U.N. to Study Reports of Sexual Violence in Israel During Oct. 7 Attack [Link]

Western media concocts ‘evidence’ UN report on Oct 7 sex crimes failed to deliver [Link]

It’s crazy that what is happening right now in some of the colleges is not to protest sexual violence as a tool of war by Hamas. This kind of ignorance or denial of sexual violence is horrible. People are so polarized to black and white that if something does not fit into their view, they are going to reject it. There are more than two sides to the Middle East story, one of them is sexual violence - mass rape, genital mutilation of men and women, women tied to trees naked bloody leg spread…

There is a long history of the involvement of women’s bodies in Wars. It’s only 30 years ago, people started to say rape is not a tool of War and should be prosecuted as a war crime against humanity. The feminist, human rights, and civil rights groups made this happen. Now it happened again in Gaza according to the report released by U.N., however there are a lot difficulties in proving and testifying the truth e.g. they couldn’t locate a single victim, or they don’t have the victim rights to take pictures. But victims are dead and they cannot speak up. Denying the fact of sexual violence is just unacceptable. And there is such a great documentary shedding lights on the unspeakable sexual violence committed on Oct 7, 2023 that I think everyone should watch.

Good news is that the testimony of eyewitness meets the criteria of any international or global court. So crimes can be proven by any eyewitness for sure.

John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges [Link]

John Schulman is a research scientist and cofounder of OpenAI, focusing on Reinforcement Learning (RL) algorithms. He gave a talk on making AI more truthful on Apr 24, 2023 in UCB. The ideas and discussions are still helpful and insightful today.

In this talk, John discussed the issue of hallucination with large language models. He claims that behavior cloning or supervised learning is not enough to fix the hallucination problem, instead, reinforcement learning from human feedback (RLHF) can help improve the model’s truthfulness by 1) adjusting output distribution so model is allowed to express uncertainty, challenge premise, admit error, and 2) learning behavior boundaries. In his conceptual model, fine-tuning leads the model to hallucinate when it lacks knowledge. Retrieval and citing external sources can help improve verifiability. John discusses models that can browse the web to answer technical questions, citing relevant sources.

John mentioned three open problems in LLM: 1) how to train models to express uncertainty in natural language, 2) go beyond what human labelers can easily verify (“scalable oversight”), and 3) optimizing for true knowledge rather than human approval.

The 1-Year Old AI Startup That’s Rivaling OpenAI — Redpoint’s AI Podcast [Link]

A great interview with the CEO of Mistral Arthur Mensch on the topic of sovereignty and open models as a business strategy. Here are some highlighted points from Arthur:

Open-source is going to solidify in the future. It is an infrastructure technology and at the end of the day it should be modifiable and owned by customers. Now Mistral has two offerings, open source one and commercial one, and the aim is to find out the business model to sustain the open source development.
The things that Mistral is best at 1) training model, and 2) specializing models.
The way they think about partnership strategy is to look at what enterprises would need, where they were operating, where the developers were operating, and figure out the channels that would facilitate adoption and spread. To be a multiplatform solution and to replicate the solution to different platforms is a strategy that Mistral is following.
There is still an efficiency upper bound to be pushed. Other than compute to spend on pre-training, there is still research to do on improving model efficiency and strength. On architecture side, we can be more efficient than plain Transformer which spends same amount of compute on every token. Mistral is making model faster. By making model faster, we open up a lot of applications that involve an LLM as a basic brick and then we can figure out how to do planning, explorations, etc. By increasing efficiency, we open up areas of research.
Meta has more GPUs than Mistral do. But Mistral has a good concentration of GPU (number of GPU per person). This is the way to be as efficient as possible to come up with creative ways of training models. Also unit economics need to be considered to make sure that $1 that you spend on training compute eventually accrues to more than $1 revenue.
Transformer is not an optimal architecture. It’s been out there for 7 years now. Everything is co-adapted to it such as training methods, debug methods, the algorithms, and hardware. It’s challenging to find a better one and also beat the baseline. But there are a lot of research on modification of attention to boost memory efficiencies and a lot of things can be done in that direction and similar directions.
About AI regulations and EU AI Act, Arthur states that it does not solve the actual problem of how to make AI safe. Because making AI safe is a hard problem (stochastic model), different from the way we evaluate software before. It’s more like a product problem rather than a regulation problem. We need to rethink continuous integration, verifications, etc and make sure everything is happening as it should be.
Mistral recently released Le Chat to help enterprise start incorporating AI. It gives an assistant that is contextualized on their enterprise data. It’s a tool to be closer to the end user to get feedback for the developer platform and also a tool to get the enterprise into GenAI.

Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI [Link]

Synthetic data is the next rage of LLM. Soumith pointed out that synthetic data is where we as humans already have good symbolic models off, we need to impart that knowledge to neural networks, and we figured out the synthetic data is a vehicle to impart this knowledge to it. Related to synthetic data but in an unusual way, there is new research on distilling GPT-4 by creating synthetic data from GPT-4, creating mock textbooks inspired by Phi-2 and then fine tuning open source models like Lambda.

Open source means different things to different people and we haven’t had a community norm definition yet at this very early stage of LLM. When being asked about open source, people in this field are used to highlight the definition of it in advance. In the open source topic, Soumith pointed out that the most beneficial value of open is it makes the distribution very wide and available with no friction so that people can do transformative things in a way that is very accessible.

Berkshire Hathaway 2024 Annual Meeting Movie: Tribute to Charlie Munger [Link]

First year that the annual meeting movie is made public. First year that the annual meeting is without Charlie. Already started to miss his jokes.

I think the reason why the car could have been completely reimagined by Apple is that they have a level of credibility and trust that I think probably no other company has, and absolutely no other tech company has. I think this was the third Steve Jobs story that I left out but in 2001, I launched a 99 cent download store and Steve Jobs just ran total circles around us, but the reason he was able to is he had all the credibility to go to the labels and get deals done for licensing music that nobody could get done before. I think that is an example of what Apple’s able to do which is to use their political capital to change the rules. So if the thing that we could all want is safer roads and autonomous vehicles, there are regions in every town and city that could be completely converted to level 5 autonomous zones. If I had to pick one company that had the credibility to go and change those rules, it’s them. Because they could demonstrate that there was a methodical safe approach to doing something. So the point is that even in these categories that could be totally reimagined, it’s not for a lack of imagination, again it just goes back to a complete lack of will. I understand because if you had 200B dollars of capital on your balance sheet, I think it’s probably easy to get fat and lazy. - Chamath Palihapitiya

― In conversation with Sam Altman — All-In Podcast [Link]

If you are a developer, the key thing to understand is where does model innovation end and your innovation begin, because if you get that wrong you will end up doing a bunch of stuff that the model will just obsolete in a few months. - David Sacks

The incentive for these folks is going to be push this stuff into the open source. Because if you solve a problem that’s operationally necessary for your business but it isn’t the core part of your business, what incentive do you have to really keep investing in this for the next 5 to 10 years to improve it. You are much better off release it in the open source, let the rest of the community take it over so that it’s available to everybody else, otherwise you are going to be stuck supporting it, and then if and when you ever wanted to switch out a model, GPT-4o, Claude, Llama, it’s going to be costly. The incentive to just push towards open source in this market if you will is so much meaningful than any other market. - Chamath Palihapitiya

I think the other thing that is probably true is a big measure at Google on the search page in terms of search engineer performance was the bounceback rate, meaning someone does a search, they go off to another site and they come back because they didn’t get the answer they wanted. Then one box launched which shows a short answer on the top, which basically keeps people from having a bad search experience, because they get the result right away. So a key metric is they are going to start to discover which vertical searches will provide the user a better experience than them jumping off to a third party page to get the same content. And then they will be able to monetize that content that they otherwise were not participating in the monetization of. So I think the real victim in all this is that long tale of content on the internet that probably gets cannibalized by the snippet one box experience within the search function. And then I do think that the revenue per search query in some of those categories actually has the potential to go up not down. You keep people on the page so you get more search volume there, you get more searches because of the examples you gave. And then when people do stay, you now have the ability to better monetize that particular search query, because you otherwise would have lost it to the third party content page. Keeping more of the experience integrated they could monetize the search per query higher and they are going to have more queries, and then they are going to have the quality of the queries go up. Going back to our earlier point about precision vs accuracy, my guess is there’s a lot of hedge fund type folks doing a lot of this Precision type of analysis trying to break apart search queries by vertical and try to figure out what the net effect will be of having better AI driven box and snippets. And my guess is that is why there is a lot of buying activity happening. I can tell you Meta and Amazon do not have an Isomorphic Lab and Waymo sitting inside their business, that suddenly pops to a couple hundred billion of market cap and Google does have a few of those. - David Friedberg

One thing I would say about big companies like Google or Microsoft is that the power of your monopoly determines how many mistakes you get to make. So think about Microsoft completely missed iPhone, remember they screwed up the whole smartphone era and it didn’t matter. Same thing here with Google, they completely screwed up AI. They invented the Transformer, completely missed LLMs. Then they had that fiasco where they have black George Washington. It doesn’t matter, they can make 10 mistakes but their monopoly is so strong, that they can finally get it right by copying the innovator, and they are probably going to be come 5T dollar company. - David Sacks

― GPT-4o launches, Glue demo, Ohalo breakthrough, Druckenmiller’s bet, did Google kill Perplexity? — All-In Podcast [Link]

Great conversations and insightful discussions as usual. Love it.

When you are over earning so massively, the rational thing to do for other actors in the arena is to come and attack that margin, and give it to people for slightly cheaper slightly faster slightly better so you can take share. So I think what you’re seeing and what you will see even more now is this incentive for Silicon Valley who has been really reticent to put money into chips, really reticent to put money into hardware. They are going to get pulled into investing this space because there is no choice. - Chamath Palihapitiya

Why? It’s not that intel was a worse company, but it’s that everything else caught up. And the economic value went to things that sat above them in the stack, then it want to Cisco for a while right, then after Cisco, it went to the browser companies for a little bit, then it went to the app companies, then it went to the device companies, then it went to the mobile companies. So you see this natural tendency for value to push up the stack over time. For AI, we’ve done the step one which is now you’ve given all this value to NVIDIA and now we are going to see it being reallocated. - Chamath Palihapitiya

The reason why they are asking these questions is that if you go back to the doom dot come boom in 1999, you can see that Cisco had this incredible run. And if you overlay the stock price of Nvidia, it seems to be following that same trajectory. And what happened with Cisco is that when the doc come crash came in 2000, Cisco stock lost a huge part of its value. Obviously Cisco is still around today and it’s a valuable company, but it just hasn’t ever regained the type of market cap it had. The reason this happened is because Cisco got commoditized. So the success and market cap of that company attracted a whole bunch of new entrance and they copied Cisco’s products until they were total commodities. So the question is whether that happened to Nvidia. I think the difference here is that at the end of the day Network equipment which Cisco produced was pretty easy to copy, whereas if you look at Nvidia, these GPU cores are really complicated to make. So it’s a much more complicated product to copy. And then on top of that, they are already in the R&D cycle for the next chip. So I think you can make the case that Nvidia has a much better moat than Cisco. - David Sacks

I think Nvidia is going to get pulled into competing directly with the hyperscalers. So if you were just selling chips, you probably wouldn’t, but these are big bulky actual machines, then all of a sudden you are like well why don’t I just create my own physical plant and just stack these things, and create racks and racks of these machines. It’s not a far stretch especially because Nvidia actually has the software interface that everybody uses which is CUDA. I think it’s likely that Nvidia goes on a full frontal assault against GCP and Amazon and Microsoft. That’s going to really complicate the relationship that those folks have with each other, but I think it’s inevitable because how do you defend an enormously large market cap, you are forced to go into businesses that are equally lucrative. Now if I look inside of compute and look at the adjacent categories, they are not going to all of a sudden start a competitor to TikTok or a social network, but if you look at the multi hundred billion revenue businesses that are adjacent to the markets that Nvidia enables, the most obvious ones are the hyperscalers. So they are going to be forced to compete otherwise their market cap will shrink and I don’t think they want that, and then it’s going to create a very complicated set of incentives for Microsoft and Google and Meta and Apple and all the rest. And that’s also going to be an accelerant, they are going to pump so much money to help all of these upstarts. - Chamath Palihapitiya

Economy is bad without recognizing that it is an inflationary experience whereas economists use the definition of “economic growth” being gross product, and so if gross product or gross revenue is going up they are like oh the economy is healthy we are growing. But the truth is we are funding that growth with leverage at the national level the federal level and at the household a domestic level. We are borrowing money to inflate the revenue numbers , and so the GDP goes up but the debt is going higher, and so the ability for folks to support themselves and buy things that they want to buy and continue to improve their condition in life has declined if things are getting worse… The average American’s ability to improve their condition has largely been driven by their ability to borrow not by their earnings. - David Friedberg

Scarlett Johansson vs OpenAI, Nvidia’s trillion-dollar problem, a vibecession, plastic in our balls [Link]

It’s a fun session and it made my day :). Great discussions about Nvidia’s business, America’s negative economic sentiment, harm of plastics, etc.

Building with OpenAI What’s Ahead [Link]

Papers and Reports

Large Language Models: A Survey [Link]

This is a must-read paper if you would like to have a comprehensive overview of SOTA LLMs, technical details, applications, datasets, benchmarks, challenges, and future directions.

Little Guide to Building Large Language Models in 2024 - HuggingFace [Link]

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks [Link]

Bloomberg fine-tuned GPT-3.5 on their financial data only to find that GPT-4 8k, without specialized finance fine-tuning, beat it on almost all finance tasks. So there is really a moat? Number of parameters matters and data size matters, and they all require compute and money.

Jamba: A Hybrid Transformer-Mamba Language Model [Link] [Link]

Mamba paper has been rejected while fruits are reaped fast: MoE-Mamba, Vision Mamba, and Jamba. It’s funny to see the asymmetric impact in ML sometimes, e.g. FlashAttention has <500 citations and is used everywhere. Github repos used by 10k+ has <100 citations, etc.

KAN: Kolmogorov-Arnold Networks [Link] [authors-note]

This is a mathematically beautiful idea. The main difference between traditional MLP and KAN is that KAN has learnable activation function on weights, so all weights in KAN are non-linear. KAN outperforms MLP in accuracy and interpretability. Whether in the future KAN is able to replace MLP depends on whether there could be suitable learning algorithms like SGD, AdamW, etc and whether it will be GPU friendly.

The Platonic Representation Hypothesis [Link]

Interesting paper to read if you like philosophy. This paper argues that there is a platonic representation as a result of convergence of AI models towards a shared statistical model of reality. They show that there is a growing similarity in data representation across different model architectures, training objectives, and data modalities, as the model size, data size, and task diversity are growing. They also proposed three hypothesis for the representation convergence: 1) The multitask scaling hypothesis, 2) The capacity hypothesis, and 3) The simplicity bias hypothesis. And it definitely worths reading the counterexamples and limitations.

Frontier Safety Framework - Google DeepMind [Link]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [Link]

One main improvement: Multi-head latent attention via compressed latent KV requires smaller amount of KV cache per token but achieves stronger performance. Heads can be compressed differently (taking different portion of compressed latent states), and keys and values can be compressed differently.

What matters when building vision-language models [Link]

The Unreasonable Ineffectiveness of the Deeper Layers [Link]

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [Link]

This paper published by Google DeepMind proposes language model called RecurrentGemma that can match or exceed the performance of transformer-based models while being more memory efficient.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach - Google’s Tech Report of LearnLM [Link]

Chameleon: Mixed-Modal Early-Fusion Foundation Models [Link]

This paper published by Meta proposed a mixed model which uses Transformer architecture under the covers but applies some innovations such as query-key normalization to fix the imbalance between the text and image tokens and other innovations as well.

Simple and Scalable Strategies to Continually Pre-train Large Language Models [[Link](https://arxiv.org/ pdf/2403.08763)]

Tricks for successful continued pretraining:

Re-warming and re-decaying the learning rate.
Adding a small portion (e.g., 5%) of the original pretraining data (D1) to the new dataset (D2) to prevent catastrophic forgetting.
Note that smaller fractions like 0.5% and 1% were also effective.

Cautious about their validity on model with larger sizes.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [Link]

Algorithmic Progress in Language Models [Link]

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [Link]

Efficient Multimodal Large Language Models: A Survey [Link]

Good overview of multimodal LLMs.

Financial Statement Analysis with Large Language Models [Link]

LoRA Learns Less and Forgets Less [Link]

Lessons from the Trenches on Reproducible Evaluation of Language Models [Link]

Challenges and best practices in evaluating LLMs.

Agent Planning with World Knowledge Model [Link]

GitHub Repo

Google Research Tune Playbook - GitHub [Link]

ML Engineering - GitHub [Link]

LLM from Scratch [Link]

Prompt Engineering Guide [Link] [Link]

ChatML + chat templates + Mistral v3 7b full example [Link]

Finetune pythia 70M [Link]

Llama3 Implemented from Scratch [Link]

News

Intel Inside Ohio [Link]

Intel Ohio One Campus Video Rendering [Link]

Intel Corp has committed $28B to build a “mega fab” called Ohio One which could be the biggest chip factory on Earth. The Biden administration has agreed to provide Intel with $19.5B in loans and grants to support finance the project.

EveryONE Medicines: Designing Drugs for Rare Diseases, One at a Time [Link]

Startup EveryONE Medicine aims to develop drugs designed based on genetic information for individual children who have rare, life-threatening neurological diseases. Since the number of patients with diseases caused by rare mutation is significant, the market share is large if EveryONE can scale its process. Although the cost won’t be the same as a standard drugmaker that runs large clinical trials, the challenge is safety without a standard clinical-testing protocol. To be responsible to patients, the initial drugs will have a temporary effect and a wide therapeutic window, so the potential toxicity will be minimized or stopped if there is.

Voyager 1’s Communication Malfunctions May Show the Spacecraft’s Age [Link]

In Nov 2023, NASA’s over 46-year-old Voyager 1 spacecraft started sending nonsense to Earth. Voyager 1 was initially intended to study Jupiter and Saturn and was built to survive only 5 years of flight, however the trajectory was forged further and further into space and so the mission converted from a two-planet mission to an interstellar mission.

In Dec 2023, the mission team restarted the Flight Data Subsystem (FDS) but failed to return the subsystem to functional state. On Mar 1 2023, they sent a command “poke” to the probe and received a response on Mar 3. On Mar 10, the mission team finally determined the response carried a readout of FDS memory. By comparing the readout with those received before the issue, the team confirmed that 3% of FDS memory was corrupted. On Apr 4, the team concluded the affected code was contained on a computer chip. To solve the problem, the team decided to divide these affected code into smaller sections and to insert those smaller sections into other operative places in the FDS memory. During Apr 18-20, the team sent out the orders to move some of the affected code and received responses with intelligible systems information.

Editing the Human Genome with AI [Link]

Berkeley based startup Profluent Bio used an AI based protein language model to create and train on an entirely new library of Cas proteins that do not exist in nature today and eventually find one called ‘OpenCRISPR-1’ that is able to replace or improve the ones that are on the market today. The goal of this AI model is to learn what sequence of DNA generated what structure of protein that’s really good at gene editing. The new library of Cas proteins is created by simulation of trillions of letters. They made ‘OpenCRISPR-1’ publicly available under an open source license so anyone can use this particular Cas protein.

Sony and Apollo in Talks to Acquire Paramount [Link]

Paramount’s stock declined 44% in 2022 and another 12% in 2023. It’s experiencing declining revenue as consumers abandon traditional pay-TV and it’s losing streaming business. Berkshire sold its entire Paramount shares in March 2023 and soon Sony Pictures and Apollo Globals Management reached out to Paramount board expressing interest of acquisition. Now Paramount decided to open negotiation with them after exclusive talks with Hollywood studio Skydance. This deal would break the Paramount and potentially transform the media landscape if successful. Otherwise an office of the CEO as the replacement of CEO Bob Bakish will be preparing a long term plan for the company.

AlphaFold 3 predicts the structure and interactions of all of life’s molecules [Link]

Previously, Google DeepMind AlphaFold project took 3D images of proteins and the DNA sequence that codes for those proteins and then they built a predictive model that predicted the 3D structure of protein base on DNA sequence. What is difference in AlphaFold 3 is that all small molecules are included. The way how small molecules are bind together with the protein is part of the predictive model. This is a breakthrough in that off target effect could be minimized by taking consideration of other molecules’ interactions in the biochemistry environment. Google has a drug development subsidiary called Isomorphic Labs. They kept all of IP for AlphaFold 3. They published a web viewer for non-commercial scientists to do fundamental research but only Isomorphic Labs can make it for commercial use.

Introducing GPT-4o and making more capabilities available for free in ChatGPT [Link]

I missed the live announcement but watched the recording. GPT-4o is amazing.

One of the interesting technical difference made is tokenizer delta. GPT-4 and GPT-4-Turbo both had a tokenizer with a vocabulary of 100k tokens. GPT-4o has a tokenizer with 200k tokens to work better for native multimodality and multilingualism. The more tokens the more efficient in generating characters.

“Our goal is to make it effortless for people to go anywhere and get anything,” said Dara Khosrowshahi, CEO of Uber. “We’re excited that this new strategic partnership with Instacart will bring the magic of Uber Eats to even more consumers, drive more business for restaurants, and create more earnings opportunities for couriers.”

― Uber Eats to Power Restaurant Delivery on Instacart [Link]

Project Astra: Our vision for the future of AI assistants [Link]

Google Keynote (Google I/O 24’) [Link]

This developer conference is about Google’s AI related product updates. Highlighted features: 1) AI Overview for search 2) Ask Photos, 3) 2M context window, 4) Google Workspace, 5) NotebookLM, 6) Project Astra, 7) Imagen 3, 8) Music AI Sandbox, 9) Veo, 10) Trillium TPU, 11) Google Serach, 12) Asking Questions with Videos, 13) Gemini interacting with Gmail and data, 14) Gemini AI Teammate, 15) Gemini App, and upgrades, 16) Gemini Trip Planning.

Leike went public with some reasons for his resignation on Friday morning. “I have been disagreeing with OpenAI leadership about the company’s core priorities for quite some time, until we finally reached a breaking point,” Leike wrote in a series of posts on X. “I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics. These problems are quite hard to get right, and I am concerned we aren’t on a trajectory to get there.”

― OpenAI created a team to control ‘superintelligent’ AI — then let it wither, source says [Link]

Other News:

Encampment Protesters Set Monday Deadline for Harvard to Begin Negotiations [Link]

Israel Gaza war: History of the conflict explained [Link]

Cyber Stuck: First Tesla Cybertruck On Nantucket Has A Rough Day [Link]

Apple apologizes after ad backlash [Link]

Apple nears deal with OpenAI to put ChatGPT on iPhone: Report [Link] [Link]

Reddit announces another big data-sharing AI deal — this time with OpenAI [Link]

Apple Will Revamp Siri to Catch Up to Its Chatbot Competitors [Link]

OpenAI strikes deal to bring Reddit content to ChatGPT [Link]