The AI Art Boom Is Built on a Legal and Cultural Blind Spot

1 day ago
5 min read

Here is a number worth sitting with: the generative AI art market is projected to exceed $2.5 billion by 2029. And yet the foundation beneath it is, at best, legally contested and culturally narrow. At worst, it is quietly doing harm to the artists, communities, and creative traditions that made it possible.

I spent the last few weeks writing an academic paper on this for my postgraduate module in Emerging Technologies at TU Dublin. What started as a literature review turned into something I kept thinking about long after submission. So I want to share the substance of it here, in plain terms.

How we got here

AI image generation did not appear from nowhere. The field traces back to the 1960s, when researchers like Michael Noll at Bell Labs and Harold Cohen with his program AARON were producing rule-based visual outputs. Impressive for the era, but no learning involved. The machines were executing instructions, not inferring patterns.

The shift came in 2015, when neural style transfer and generative adversarial networks (GANs) introduced data-driven learning. For the first time, a system could develop a visual style by training on images rather than following explicit rules. Then in 2021, CLIP bridged language and vision, making prompt-based generation practical. And in 2022, Stable Diffusion was released as an open model to the public, trained on a dataset called LAION-5B: roughly five billion image-text pairs scraped from the web.

That open release made something unusually visible: the provenance of the training data. Normally, you do not get to see what a commercial model was trained on. With Stable Diffusion, you could. And what the community found raised serious questions.

Today, in 2026, the production landscape includes GPT Image, Adobe Firefly, FLUX, Midjourney, and Imagen. Each takes a different approach, including different stances on training data. Firefly, for instance, uses only licensed and public domain content. That is a deliberate business decision, not a coincidence. Data provenance has become a market variable.

The legal problem

The dominant legal framework in Europe for this is a combination of the CDSM Copyright Directive (2019/790) and the EU AI Act (Regulation 2024/1689).

The CDSM Directive allows text and data mining (TDM) for commercial purposes under Article 4, but only if rightsholders have not opted out. The opt-out exists on paper. In practice, it is nearly unenforceable. There is no standard machine-readable format, no central registry, and no consistent way for a rights holder to signal “do not train on my work” in a way that AI developers are technically required to honour.

The EU AI Act, which became applicable this year, requires general-purpose AI providers to publish a “sufficiently detailed summary” of their training data under Article 53(1)(d). But a 2025 European Parliament study concluded that this summary requirement is completely inadequate: it cannot help individual creators identify whether their specific work was used.

The courts are also catching up, but slowly. The UK High Court ruled in Getty Images v. Stability AI [2025] that Stable Diffusion did not store copies of training images, rejecting the secondary infringement claim. The German case Kneschke v. LAION probed the TDM exception for dataset creation. Andersen v. Stability AI continues in the US. Each case is turning on different facts, different jurisdictions, different legal doctrines. The territoriality problem alone is significant: training can be relocated to a jurisdiction with weaker protections.

The scholars I reviewed mostly agree on the problem. They persistently disagree on the solution.

The cultural problem

The legal gap gets more coverage. The cultural one may be more consequential in the long run.

LAION-5B, the dataset behind much of the diffusion era, reflects the web as it existed circa 2021. The web skews heavily toward English-language content, Western imagery, and dominant aesthetic conventions. You train on that data, you amplify those biases. This is not speculation. It is now documented in peer-reviewed research.

A 2023 study by Bianchi and colleagues, published at the FAccT conference, found that easily accessible AI image systems amplify demographic stereotypes at scale. A 2025 systematic review by Elsharif and colleagues, covering 58 studies, found that bias is inherent in training data and that non-Western visual traditions are persistently underrepresented. A 2024 study from CHI by Zhang and colleagues found that disadvantaged cultures are systematically neglected in text-to-image outputs.

The outputs are not culturally neutral. When you prompt a model for “a wedding,” “a doctor,” “a traditional home,” or “a celebration,” you get outputs shaped by whatever was most prevalent in the training corpus. Western faces, Western aesthetics, Western contexts.

There is also a homogenisation effect. Research from 2025 found measurable racial homogenisation in AI-generated faces. A separate study found that audiences tend to prefer AI-generated artworks, even while still detecting them above chance. That last finding has a quietly alarming economic implication: if AI-generated work is preferred, the market incentive to sustain diverse human creative production weakens.

Why both problems matter together

The frustrating thing, writing the paper, was how siloed the research is. The technical literature on bias metrics rarely references the legal literature on opt-out mechanisms. The legal scholars rarely engage with the cultural representativeness data. And the policy documents sit in a third silo.

These are not separate problems. They are the same problem at different layers. If training data is legally contested, that creates pressure to document provenance. If provenance is documented, that creates an opportunity to also measure cultural coverage. If coverage is measured, that creates a procurement criterion: not just “is this dataset legally clean?” but “is it representative?”

There are existing tools trying to address the consent side: Glaze (style cloaking), Nightshade (data poisoning), C2PA Content Credentials. All have meaningful limitations. Watermarks are removable. Compliance is voluntary. The adversarial arms race between artist protections and training pipelines is real, and artists are not winning it.

The two questions I could not resolve

My paper ends with two research questions, because I think they are genuinely open and deserve more rigorous investigation than a single literature review can provide.

The first: do the EU AI Act and the CDSM Directive adequately protect the intellectual property rights of visual artists whose work is used to train AI image generation systems, and what specific legislative reforms would close the gaps?

The second: to what extent do Western-centric training datasets systematically reproduce and amplify dominant aesthetic norms, and what technical, curatorial, or policy interventions could foster greater cultural diversity in AI-generated visual art?

Neither of these is rhetorical. Both have practical implications for anyone building, procuring, or regulating AI image systems.

The $2.5 billion market is real. The creative output is impressive. But the foundation is contested in ways that most of the industry is not yet reckoning with honestly.

What is your read on this? Are legal and cultural concerns around AI training data getting enough serious attention, or are they being drowned out by the pace of the product launches?

Some recommended reading if you want to dig deeper into this topic:

Bianchi et al. (FAccT 2023) on stereotype amplification in AI image systems: https://arxiv.org/abs/2211.03759
Elsharif et al. (IEEE Access 2025), a 58-study systematic review of bias in text-to-image generation: https://doi.org/10.1109/ACCESS.2025.3585745
Zhang et al. (ACM CHI 2024) on cultural representativeness in text-to-image outputs: https://doi.org/10.1145/3613904.3642877