
Machine learning models need training data to improve their accuracy—something I know from my many years in biometrics.
And it’s difficult to get that training data—something else I know from my many years in biometrics. Consider the acronyms GDPR, CRPA, and especially BIPA. It’s very hard to get data to train biometric algorithms, so they are trained on relatively limited data sets.
At the same time that biometric algorithm training data is limited, Kevin Indig believes that generative AI large language models are ALSO going to encounter limited accessibility to training data. Actually, they are already.
The lawsuits have already begun
A few months ago, generative AI models like ChatGPT were going to solve all of humanity’s problems and allow us to lead lives of leisure as the bots did all our work for us. Or potentially the bots would get us all fired. Or something.
But then people began to ask HOW these large language models work…and where they get their training data.
Just like biometric training models that just grab images and associated data from the web without asking permission (you know the example that I’m talking about), some are alleging that LLMs are training their models on copyrighted content in violation of the law.
I am not a lawyer and cannot meaningfully discuss what is “fair use” and what is not, but suffice it to say that alleged victims are filing court cases.
Sarah Silverman et al and copyright infringement
Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta each in a US District Court over dual claims of copyright infringement.
The suits alleges, among other things, that OpenAI’s ChatGPT and Meta’s LLaMA were trained on illegally-acquired datasets containing their works, which they say were acquired from “shadow library” websites like Bibliotik, Library Genesis, Z-Library, and others, noting the books are “available in bulk via torrent systems.”
From https://www.theverge.com/2023/7/9/23788741/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai
This could be a big mess, especially since copyright laws vary from country to country. This description of copyright law LLM implications, for example, is focused upon United Kingdom law. Laws in other countries differ.
And now the technical blocks are beginning
Just today, Kevin Indig highlighted another issue that could limit LLM access to online training data.
Some sites are already blocking the LLM crawlers
Systems that get data from the web, such as Google, Bing, and (relevant to us) ChatGPT, use “crawlers” to gather the information from the web for their use. ChatGPT, for example, has its own crawler.

Guess what Indig found out about ChatGPT’s crawler?
An analysis of the top 1,000 sites on the web from Originality AI shows 12% already block Chat GPT’s crawler. (source)
From https://www.kevin-indig.com/most-sites-will-block-chat-gpt/
But that only includes the sites that blocked the crawler when Originality AI performed its analysis.
More sites will block the LLM crawlers
Indig believes that in the future, the number of the top 1000 sites that will block ChatGPT’s crawler will rise significantly…to 84%. His belief is based on analyzing the business models for the sites that already block ChatGPT and assuming that other sites that use the same business models will also find it in their interest to block ChatGPT.
The business models that won’t block ChatGPT are assumed to include governments, universities, and search engines. Such sites are friendly to the sharing of information, and thus would have no reason to block ChatGPT or any other LLM crawler.
The business models that would block ChatGPT are assumed to include publishers, marketplaces, and many others. Entities using these business models are not just going to turn it over to an LLM for free.
As Indig explains regarding the top two blocking business models:

For publishers, content is the product. Giving it away for free to generative AI means foregoing most if not all, ad revenue. Publishers remember the revenue drops caused by social media and modern search engines in the late 2,000s.
Marketplaces build their own AI assistants and don’t want competition.
From https://www.kevin-indig.com/most-sites-will-block-chat-gpt/
What does this mean for LLMs?
One possibility is that LLMs will run into the same training issues as biometric algorithms.
- In biometrics, the same people that loudly exclaim that biometric algorithms are racist would be horrified at the purely technical solution that would solve all inaccuracy problems—let the biometric algorithms train on ALL available biometric data. In the activists’ view (and in the view of many), unrestricted access to biometric data for algorithmic training would be a privacy nightmare.
- Similarly, those who complain that LLMs are woefully inaccurate would be horrified if the LLM accuracy problem were solved by a purely technical solution: let the algorithms train themselves on ALL available data.
Could LLMs buy training data?
Of course, there’s another solution to the problem: have the companies SELL their data to the LLMs.

In theory, this could provide the data holders with a nice revenue stream while allowing the LLMs to be extremely accurate. (Of course the users who actually contribute the data to the data holders would probably be shut out of any revenue, but them’s the breaks.)
But that’s only in theory. Based upon past experience with data holders, the people who want to use the data are probably not going to pay the data holders sufficiently.
Google and Meta to Canada: Drop dead / Mourir

Even today, Google and Meta (Facebook et al) are greeting Canada’s government-mandated Bill C-18 with resistance. Here’s what Google is saying:
Bill C-18 requires two companies (including Google) to pay for simply showing links to Canadian news publications, something that everyone else does for free. The unprecedented decision to put a price on links (a so-called “link tax”) breaks the way the web and search engines work, and exposes us to uncapped financial liability simply for facilitating access to news from Canadian publications….
As a result, we have informed them that we have made the difficult decision that, when the law takes effect, we will be removing links to Canadian news publications from our Search, News, and Discover products.
From https://blog.google/canada-news-en/#overview
But wait, it gets better:
In addition, we will no longer be able to operate Google News Showcase – our product experience and licensing program for news – in Canada.
From https://blog.google/canada-news-en/#overview
Google News Showcase is the program that gives money to news organizations in Canada. Meta has a similar program. Peter Menzies notes that these programs give tens of millions of (Canadian) dollars to news organizations, but that could end, despite government threats.
The federal and Quebec governments pulled their advertising spends, but those moves amount to less money than Meta will save by ending its $18 million in existing journalism funding.
From https://thehub.ca/2023-09-15/peter-menzies-the-media-is-boycotting-meta-and-nobody-cares/
What’s next?
Bearing in mind that Big Tech is reluctant to give journalistic data holders money even when a government ORDERS that they do so…
…what is the likelihood that generative AI algorithm authors (including Big Tech companies like Google and Microsoft) will VOLUNTARILY pay funds to data holders for algorithm training?
If Kevin Indig is right, LLM training data will become extremely limited, adversely affecting the algorithms’ use.


1 Comment