Vigilance against closed machine learning development
Vigilance against closed machine learning development Photo: TUCHONG
The explosive popularity of generative machine learning models since 2022 has sparked a new wave of enthusiasm for artificial intelligence (AI). Yet their rapid development has also complicated efforts to regulate large technology companies. Existing research has shown that tech giants’ control over digital platforms has granted them not only economic leverage but also political influence. Behind this asymmetry lies their steadily expanding logistical power. The emergence and widespread adoption of generative machine learning models have further entrenched this imbalance and begun to influence the field of machine learning itself. One growing risk is that as models become larger, the field may become increasingly closed.
Logistical power
From the outset, venture capital and other emerging forms of “patient capital” have pursued investments aimed at building monopolistic platforms. Companies that survive the competitive race and establish monopolies are able to leverage their dominant position to engage in unfair competition. In this process, data has become a key new source of profit. The influence of these tech giants now extends far beyond the economic sphere. Some scholars argue that platform values have severely disrupted existing regulatory systems in European cities. A key precondition for such influence is the platform’s role as a widely adopted infrastructure. The capacity to build such infrastructure has been termed “logistical power” by sociologists. This form of power is highly dependent on specialized expertise and is thus difficult for states to monopolize; in many cases, it also relies on the participation of non-state actors. As generative machine learning models began to demonstrate their potential, this deeply uneven distribution of power began to erode the open knowledge infrastructure that had previously underpinned progress in the field.
The flourishing of machine learning has long depended on open, collaborative exchanges between industry and academia. This interaction has been grounded in open-source software and commercial hardware. The open-source strategy has simultaneously served the needs of both sectors, laying the foundation for productive collaboration. Academic researchers need to accumulate reputational capital—they want their research to be publicly recognized and widely disseminated. They also seek access to data and funding from the industrial sector. In turn, companies depend on researchers’ creative labor to drive innovation and improve their competitive edge in the market.
Closed development
The emergence of generative machine learning models, especially the success of large language models (LLMs), is now beginning to challenge this open ecosystem. What sets LLMs apart is their unprecedented appetite for both data and computational power. In today’s climate of intensifying corporate and geopolitical competition, these demands are reshaping the material and technological ecosystem supported by open–source software and commercial hardware. Research therefore tends to focus on specific actors.
The most striking example of machine learning’s shift toward closed development can be seen in the evolution of OpenAI’s corporate structure and project implementation. Originally founded with the mission of making safe general AI openly accessible for the benefit of humanity, OpenAI pledged to release its patents and research to the public and foster collaboration across disciplines. However, as the computational and data demands of LLMs mounted, OpenAI came under increasing financial pressure. In 2019, the company reorganized to create a for-profit subsidiary and accepted investment from Microsoft. From that point on, OpenAI’s strategic orientation shifted. By the time GPT-4 was released, its technical documentation focused solely on performance benchmarks, omitting details about training methods, datasets, and even the software framework used for deep learning. On the commercial side, OpenAI became deeply integrated into Microsoft’s business ecosystem, granting the company exclusive rights to many of its algorithms and models.
There is now a growing consensus that the defining standard for LLMs lies in their generative capacity—specifically, whether they exhibit new capabilities. As model sizes increase, conducting prototype research on local machines has become increasingly impractical. As a result, large-scale cloud computing systems have become indispensable not only for training but also for research. At the same time, LLMs’ voracious appetite for data has diminished the role of public databases. With much of the training data scraped from the web, potential copyright disputes abound, and many tech companies have become increasingly unwilling to disclose their data sources. This trend toward opacity extends beyond consumer data. While NVIDIA remains the leading manufacturer of machine learning hardware, a more diverse (and proprietary) hardware ecosystem is taking shape.
These developments have been accelerated by the commercial potential demonstrated by generative models, as well as by intensifying geopolitical competition. AI is now viewed as a key arena of technological and economic rivalry between China and the United States. In this context, the U.S. has imposed restrictions on the circulation of advanced hardware. Beyond curbing overseas chip industries, it has also placed export controls on NVIDIA’s high-end GPU-based systems, even those intended for the consumer market.
Contrary to early fears of a single-company monopoly, the field of LLMs has instead seen a proliferation of open-source efforts, resulting in a diverse and vibrant ecosystem. However, these open-source models still fall short of their closed-source counterparts in both performance and safety. Moreover, many are not truly open-source in the traditional sense. Rather, they should be viewed as tools through which companies assert their logistical power and shape the commercial AI environment. These initiatives tend to share two defining features: first, instead of adopting standard open-source licenses, many firms now release their models under specially tailored terms; second, most of these releases include only model parameters, withholding both the training data and the code used in the training process.
Meta and Alibaba both prohibit the use of outputs from these “open-source” LLMs for training or fine-tuning other LLMs—an indirect but effective way to prevent competitors from leveraging these resources. As such, these models function more as free products made available to developers. The open-sourcing of LLMs by Meta, Microsoft, and Alibaba is closely tied to their respective business strategies. By embedding LLMs into their proprietary cloud services, Microsoft and Alibaba can increase customer retention, attract developers seeking more convenient access to LLMs, and boost sales of cloud computing resources.
Due to the technical properties and application risks unique to generative LLMs, the traditional open-source paradigm—designed for conventional software—is increasingly inadequate in addressing the challenges these models pose. The data used to train LLMs is often scraped from the internet, meaning that open-sourcing training datasets would raise significant ethical and legal concerns. Traditional open-source practices also fail to bridge the deepening asymmetry in computing power and data access between industry and academia. More importantly, current open-source norms impose few, if any, constraints on how released models are used. For conventional open-source software, the primary risk typically lies in vulnerabilities in the codebase. Because these projects are managed in a decentralized fashion, patching security flaws can be difficult, potentially compromising the systems built upon them. In contrast, for LLMs, the main risk stems from how the models are deployed—an area largely overlooked in the existing open-source paradigm.
Paradigm shift
To address this regulatory blind spot, the licenses accompanying open-source LLMs released by Meta, Alibaba, and various research institutions now typically include clauses addressing legal and policy compliance. Whether responding to the disruption faced by companies or to the broader challenges of generative AI, the knowledge infrastructure underpinning machine learning stands at a pivotal moment of paradigm shift.
The disruptive impact of generative models—especially LLMs—extends beyond their widespread applications and societal consequences; it also deepens existing imbalances in logistical power. Understanding the risks these models pose requires close attention to the elements that comprise the knowledge infrastructure supporting them. Ensuring that technical expertise and production capabilities in machine learning are not monopolized by a handful of corporations is essential for society to retain awareness and oversight. Yet this is precisely where both governments and the general public often fall short—while academic institutions, especially universities, tend to be stronger. The machine learning field itself emerged through collaboration between academia and industry. In the face of the disruptions brought about by LLMs, research institutions bear added societal responsibilities: they can act as counterweights to the rising logistical power of tech giants or as bridges facilitating collaboration between the public and the industry. At the same time, they have a duty to explain the attendant risks to the public and offer constructive recommendations to regulators.
Zhang Bolun is a research fellow from the Department of Sociology at Zhejiang University.
Edited by ZHAO YUAN