I’ve been writing about the democratic future of large language models (LLMs). Will this tech turn out to be an inherently centralized, authoritarian technology like nuclear power, or a more inherently democratic technology like solar power? I’ve suggested that there are three dimensions to watch as the technology develops, including how accessible the training of new models is, and the extent to which it is practical to run models on local hardware.
The third big drama that bears on future freedom is the race between language models that are open source and those that are closed. When OpenAI released ChatGPT, the model behind it was largely opaque and mysterious, as were early competing models from Google and Anthropic. Since then, a variety of models with varying degrees of openness have been released.
In early 2024, VentureBeat predicted that while “open source was slow off the starting block,” it was “only a matter of time before open-source catches up with the closed-source models.” In February one LLM expert posted a survey asking his AI-focused audience to guess, in months, how long “the delay from the best ‘closed-source’ model release to it being reproduced in an ‘open source’ model will be.” The answers varied widely but the average guess was just 16 months, reflecting widespread optimism about the future of open models.
While models that are more open still can’t match the closed frontier models made by the biggest companies, they have indeed closed the gap. In early 2023, Meta became the first company to make a large model (LLaMA) available for researchers to use and modify as they wish. The model’s openness was quite limited — it only released the model weights (the numbers representing the strength of associations among words), and only to select researchers, and under a restrictive license. And when the weights leaked, the company used takedown requests to try to keep control of it. But LlaMA’s model weights became widely available, which was greeted with excitement by researchers because it was the first genuinely large and powerful model that people could take and modify as they saw fit.
A bigger boost for open models came with DeepSeek’s releases in late 2024 and early 2025. Not only did the company invent innovative new LLM training techniques, it shared them with the world. Open-source LLM researcher Nathan Lambert called the DeepSeek R1 model “a major reset” — the first time since the advent of ChatGPT “that we’ve had a really clear frontier model that is open weight and with a commercially friendly license with no restrictions on downstream use cases.” DeepSeek gave a big boost to LLMs as an open global scientific research project, as opposed to the closed and secretive “Manhattan Project” approach — so much so that Sam Altman of OpenAI, which several years earlier had effectively switched teams in that battle, betraying its name, observed in response that “I personally think we have been on the wrong side of history here and need to figure out a different open-source strategy.” (And in fact, on August 5, OpenAI released its first open-weight models in over five years.) Meanwhile, Chinese organizations, as Lambert put it, continue to release “the most notable open models and datasets,” many of which are “competitive with leading frontier models in the U.S.”
What is "open"?
The openness of the LlaMA, Chinese, and new OpenAI models are not complete, however. Openness is a scale rather than a binary state, and some have identified over a dozen elements that can be open or closed. The most significant include:
- The data used for pre-training and post-training.
- The computer code that is used to implement a model’s chosen architecture and training techniques.
- The base and post-trained model weights.
- Documentation of the techniques that were used in training the model, typically in the form of technical reports, pre-print academic papers, and the like. Documentation can be various degrees of clear and useful, or vague and useless.
- A permissive intellectual property license that allows anyone to copy, modify, build upon, and use the model as they wish.
Meta’s Llama model included open model weights, but little else. DeepSeek’s included open weights, extensive documentation of its inventive training techniques, and a permissive license, but not the data upon which they were trained. Today, however, there are a growing number of collaborative efforts to build truly open source models.
A useful definition of a “fully open” model might be one that includes enough information that its training is reproducible — that any researcher could use the same training data, replicate the reported training “recipe,” and produce more or less the same model. In other scientific fields, reproducibility in research results has always been critical.
Perhaps the biggest factor supporting open source is the researchers themselves. The training of frontier models requires not just money, energy, and access to scarce computer chips; it also requires expertise. The more that expertise spreads and the wider the field, the less the technology becomes susceptible to centralized control. And even the efforts of closed LLM developers may not prevent that spread, because many researchers actively want to be part of an open scientific project — including, from all appearances, those in China such as the scientists at DeepSeek who changed the field with their openly published innovations. Companies that produce more open products are attractive to employees and doing so serves as a recruiting tool in a context where AI experts are in high demand. Top researchers at OpenAI, Google, and other firms publish many academic papers on their work; people who are passionate and excited about what they do tend to like credit and recognition within their field when they make an advance.
Another important factor: in Silicon Valley, restrictions on non-compete clauses under California law and a longstanding culture of frequent job-hopping, porous company boundaries, and the open exchange of information undermine attempts to keep technological knowledge and breakthroughs secret. “Top employees get bought out by other companies for a pay raise, and a large reason why these companies do this is to bring ideas with them,” says Lambert.
Open models are good for freedom
Openness has many advantages for democracy and civil liberties. Insofar as research is an open scientific process rather than a closed, Manhattan Project-like endeavor, that will lower the chances that any one company or nation will dominate the technology and leverage it to the detriment of ordinary people. It will help spread AI research expertise, which today is one of the most scarce and expensive resources needed to build a cutting-edge language model. As such expertise spreads, that in turn will help produce a diversity of models that will reduce the power of any one of them.
The more that just a few models become dominant, the more vital openness becomes. LLMs are a non-deterministic technology that is unpredictable and inscrutable in the best of circumstances, and models that are more open give researchers and others a better shot at testing, studying, and experimenting with them. That can surface accidental or intentional racial or viewpoint bias, censorship, training for anti-social purposes, or security traps. Identifying such problems is a prerequisite for trying to fix them or to motivate the building of competing models that lack them.
For all the splash that DeepSeek’s models made, for example, people quickly found that the models censored topics that the current Chinese government wants to suppress, such as the Tiananmen Square massacre. And not just governments but also tech billionaires like Elon Musk, according to reports, appear to have suppressed or deprecated disfavored information while elevating controversial and dubious viewpoints within LLMs. Nevertheless, where a model is open enough, it’s possible to combat limitations and behavior that its creators have tried to instill. The AI search company Perplexity, for example, created a new, fine-tuned version of DeepSeek’s R1 model to remove the Chinese government-mandated censorship. Given the nondeterministic nature of LLMs, it’s not clear how easy or thorough such de-programming is. (In part its effectiveness can depend on whether a model’s creators inculcated the censorship in post-training — teaching it to remain silent about certain topics that it was trained upon — or during pre-training, for example by removing disfavored material from the initial training data, which may be harder to root out.) But with a closed model, nobody has even a chance of fixing censorship, bias, or other perceived flaws. Given the potential power of a dominant LLM, that may be a crucial ability.
Overall, the existence of a thriving international, collaborative, open scientific enterprise dedicated to LLM research and experimentation is completely at odds with the conception reflected in much of the discourse around this technology, which still centers around a U.S.-China arms race and the business fortunes of for-profit LLM enterprises. Of course it’s possible that one organization will achieve a breakthrough that puts it in some kind of dominant position, but given the trends in LLM research we’re seeing today, that looks less and less likely. And even if it did happen, it’s hard to imagine that such a breakthrough would remain in the exclusive secret possession of any one organization or nation for very long.
Support for open-source AI research — including by policymakers — can help make sure this is the case, and more generally ensure that this is a technology that increases the power of individuals rather than concentrating it in a few hands.
This post is part 4 in a series:
Part 1: Do large language models have an inherent politics?
Part 2: Will Giant Companies Always Have a Monopoly on Top AI Models?
Part 3: What's the Future of AI Language Models as a Decentralized Technology?