Keeping the internet open is crucial, and part of being open means Reddit content needs to be accessible to those fostering human learning and researching ways to build community, belonging, and empowerment online. Reddit is a uniquely large and vibrant community that has long been an important space for conversation on the internet. Additionally, using LLMs, ML, and AI allow Reddit to improve the user experience for everyone.
In line with this, Reddit and OpenAI today announced a partnership to benefit both the Reddit and OpenAI user communities…
Perhaps some members of the Reddit user community may not feel the benefits when OpenAI is training on their data.
While people who joined Reddit presumably understood that anyone could view their data, they never imagined that a third party would then process its data for its own purposes.
Oh, but wait a minute. Reddit clarifies things:
This partnership…does not change Reddit’s Data API Terms or Developer Terms, which state content accessed through Reddit’s Data API cannot be used for commercial purposes without Reddit’s approval. API access remains free for non-commercial usage under our published threshold.
If you’re a biometric product marketing expert, or even if you’re not, you’re presumably analyzing the possible effects to your identity/biometric product from the proposed changes to the Biometric Information Privacy Act (BIPA).
As of May 16, the Illinois General Assembly (House and Senate) passed a bill (SB2979) to amend BIPA. It awaits the Governor’s signature.
What is the amendment? Other than defining an “electronic signature,” the main purpose of the bill is to limit damages under BIPA. The new text regarding the “Right of action” codifies the concept of a “single violation.”
(T)he amended law DOES NOT CHANGE “Private Right of Action” so BIPA LIVES!
Companies who violate the strict requirements of BIPA aren’t off the hook. It’s just that the trial lawyers—whoops, I mean the affected consumers make a lot less money.
It discussed both large language models and large multimodal models. In this case “multimodal” is used in a way that I normally DON’T use it, namely to refer to the different modes in which humans interact (text, images, sounds, videos). Of course, I gravitated to a discussion in which an image of a person’s face was one of the modes.
In this post I will look at LMMs…and I will also look at LMMs. There’s a difference. And a ton of power when LMMs and LMMs work together for the common good.
When Google announced its Gemini series of AI models, it made a big deal about how they were “natively multimodal.” Instead of having different modules tacked on to give the appearance of multimodality, they were apparently trained from the start to be able to handle text, images, audio, video, and more.
Other AI models are starting to function in a TRULY multimodal way, rather than using separate models to handle the different modes.
So now that we know that LLMs are large multimodal models, we need to…
…um, wait a minute…
Introducing the Large Medical Model (LMM)
It turns out that the health people have a DIFFERENT definition of the acronym LMM. Rather than using it to refer to a large multimodal model, they refer to a large MEDICAL model.
Our first of a kind Large Medical Model or LMM for short is a type of machine learning model that is specifically designed for healthcare and medical purposes. It is trained on a large dataset of medical records, claims, and other healthcare information including ICD, CPT, RxNorm, Claim Approvals/Denials, price and cost information, etc.
I don’t think I’m stepping out on a limb if I state that medical records cannot be classified as “natural” language. So the GenHealth.AI model is trained specifically on those attributes found in medical records, and not on people hemming and hawing and asking what a Pekingese dog looks like.
But there is still more work to do.
What about the LMM that is also an LMM?
Unless I’m missing something, the Large Medical Model described above is designed to work with only one mode of data, textual data.
But what if the Large Medical Model were also a Large Multimodal Model?
Rather than converting a medical professional’s voice notes to text, the LMM-LMM would work directly with the voice data. This could lead to increased accuracy: compare the tone of voice of an offhand comment “This doesn’t look good” with the tone of voice of a shocked comment “This doesn’t look good.” They appear the same when reduced to text format, but the original voice data conveys significant differences.
Rather than just using the textual codes associated with an X-ray, the LMM-LMM would read the X-ray itself. If the image model has adequate training, it will again pick up subtleties in the X-ray data that are not present when the data is reduced to a single medical code.
In short, the LMM-LMM (large medical model-large multimodal model) would accept ALL the medical outputs: text, voice, image, video, biometric readings, and everything else. And the LMM-LMM would deal with all of it natively, increasing the speed and accuracy of healthcare by removing the need to convert everything to textual codes.
A tall order, but imagine how healthcare would be revolutionized if you didn’t have to convert everything into text format to get things done. And if you could use the actual image, video, audio, or other data rather than someone’s textual summation of it.
Obviously you’d need a ton of training data to develop an LMM-LMM that could perform all these tasks. And you’d have to obtain the training data in a way that conforms to privacy requirements: in this case protected health information (PHI) requirements such as HIPAA requirements.
But if someone successfully pulls this off, the benefits are enormous.
You’ve come a long way, baby.
Robert Young (“Marcus Welby”) and Jane Wyatt (“Margaret Anderson” on a different show). By ABC TelevisionUploaded by We hope at en.wikipedia – eBay itemphoto informationTransferred from en.wikipedia by SreeBot, Public Domain, https://commons.wikimedia.org/w/index.php?curid=16472486.
The Digital Trust & Safety Partnership (DTSP) consists of “leading technology companies,” including Apple, Google, Meta (parent of Facebook, Instagram, and WhatsApp), Microsoft (and its LinkedIn subsidiary), TikTok, and others.
DTSP appreciates and shares Ofcom’s view that there is no one-size-fits-all approach to trust and safety and to protecting people online. We agree that size is not the only factor that should be considered, and our assessment methodology, the Safe Framework, uses a tailoring framework that combines objective measures of organizational size and scale for the product or service in scope of assessment, as well as risk factors.
We’ll get to the “Safe Framework” later. DTSP continues:
Overly prescriptive codes may have unintended effects: Although there is significant overlap between the content of the DTSP Best Practices Framework and the proposed Illegal Content Codes of Practice, the level of prescription in the codes, their status as a safe harbor, and the burden of documenting alternative approaches will discourage services from using other measures that might be more effective. Our framework allows companies to use whatever combination of practices most effectively fulfills their overarching commitments to product development, governance, enforcement, improvement, and transparency. This helps ensure that our practices can evolve in the face of new risks and new technologies.
But remember that the UK’s neighbors in the EU recently prescribed that USB-3 cables are the way to go. This not only forced DTSP member Apple to abandon the Lightning cable worldwide, but it affects Google and others because there will be no efforts to come up with better cables. Who wants to fight the bureaucratic battle with Brussels? Or alternatively we will have the advanced “world” versions of cables and the deprecated “EU” standards-compliant cables.
So forget Ofcom’s so-called overbearing approach and just adopt the Safe Framework. Big tech will take care of everything, including all those age assurance issues.
Incorporating each characteristic comes with trade-offs, and there is no one-size-fits-all solution. Highly accurate age assurance methods may depend on collection of new personal data such as facial imagery or government-issued ID. Some methods that may be economical may have the consequence of creating inequities among the user base. And each service and even feature may present a different risk profile for younger users; for example, features that are designed to facilitate users meeting in real life pose a very different set of risks than services that provide access to different types of content….
Instead of a single approach, we acknowledge that appropriate age assurance will vary among services, based on an assessment of the risks and benefits of a given context. A single service may also use different approaches for different aspects or features of the service, taking a multi-layered approach.
Before you can fully understand the difference between personally identifiable information (PII) and protected health information (PHI), you need to understand the difference between biometrics and…biometrics. (You know sometimes words have two meanings.)
Designed by Google Gemini.
The definitions of biometrics
To address the difference between biometrics and biometrics, I’ll refer to something I wrote over two years ago, in late 2021. In that post, I quoted two paragraphs from the International Biometric Society that illustrated the difference.
Since the IBS has altered these paragraphs in the intervening years, I will quote from the latest version.
The terms “Biometrics” and “Biometry” have been used since early in the 20th century to refer to the field of development of statistical and mathematical methods applicable to data analysis problems in the biological sciences.
Statistical methods for the analysis of data from agricultural field experiments to compare the yields of different varieties of wheat, for the analysis of data from human clinical trials evaluating the relative effectiveness of competing therapies for disease, or for the analysis of data from environmental studies on the effects of air or water pollution on the appearance of human disease in a region or country are all examples of problems that would fall under the umbrella of “Biometrics” as the term has been historically used….
The term “Biometrics” has also been used to refer to the field of technology devoted to the identification of individuals using biological traits, such as those based on retinal or iris scanning, fingerprints, or face recognition. Neither the journal “Biometrics” nor the International Biometric Society is engaged in research, marketing, or reporting related to this technology. Likewise, the editors and staff of the journal are not knowledgeable in this area.
In brief, what I call “broad biometrics” refers to analyzing biological sciences data, ranging from crop yields to heart rates. Contrast this with what I call “narrow biometrics,” which (usually) refers only to human beings, and only to those characteristics that identify human beings, such as the ridges on a fingerprint.
The definition of “personally identifiable information” (PII)
Information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other information that is linked or linkable to a specific individual.
Note the key words “alone or when combined.” The ten numbers “909 867 5309” are not sufficient to identify an individual alone, but can identify someone when combined with information from another source, such as a telephone book.
What types of information can be combined to identify a person? The U.S. Department of Defense’s Privacy, Civil Liberties, and Freedom of Information Directorate provides multifarious examples of PII, including:
Social Security Number.
Passport number.
Driver’s license number.
Taxpayer identification number.
Patient identification number.
Financial account number.
Credit card number.
Personal address.
Personal telephone number.
Photographic image of a face.
X-rays.
Fingerprints.
Retina scan.
Voice signature.
Facial geometry.
Date of birth.
Place of birth.
Race.
Religion.
Geographical indicators.
Employment information.
Medical information.
Education information.
Financial information.
Now you may ask yourself, “How can I identify someone by a non-unique birthdate? A lot of people were born on the same day!”
But the combination of information is powerful, as researchers discovered in a 2015 study cited by the New York Times.
In the study, titled “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata,” a group of data scientists analyzed credit card transactions made by 1.1 million people in 10,000 stores over a three-month period. The data set contained details including the date of each transaction, amount charged and name of the store.
Although the information had been “anonymized” by removing personal details like names and account numbers, the uniqueness of people’s behavior made it easy to single them out.
In fact, knowing just four random pieces of information was enough to reidentify 90 percent of the shoppers as unique individuals and to uncover their records, researchers calculated. And that uniqueness of behavior — or “unicity,” as the researchers termed it — combined with publicly available information, like Instagram or Twitter posts, could make it possible to reidentify people’s records by name.
Now biometrics only form part of the multifarious list of data cited above, but clearly biometric data can be combined with other data to identify someone. An easy example is taking security camera footage of the face of a person walking into a store, and combining that data with the same face taken from a database of driver’s license holders. In some jurisdictions, some entities are legally permitted to combine this data, while others are legally prohibited from doing so. (A few do it anyway. But I digress.)
Because narrow biometric data used for identification, such as fingerprint ridges, can be combined with other data to personally identify an individual, organizations that process biometric data must undertake strict safeguards to protect that data. If personally identifiable information (PII) is not adequately guarded, people could be subject to fraud and other harms.
The definition of “protected health information” (PHI)
Protected Health Information. The Privacy Rule protects all “individually identifiable health information” held or transmitted by a covered entity or its business associate, in any form or media, whether electronic, paper, or oral. The Privacy Rule calls this information “protected health information (PHI).”12
“Individually identifiable health information” is information, including demographic data, that relates to:
the individual’s past, present or future physical or mental health or condition,
the provision of health care to the individual, or
the past, present, or future payment for the provision of health care to the individual,
and that identifies the individual or for which there is a reasonable basis to believe it can be used to identify the individual.13 Individually identifiable health information includes many common identifiers (e.g., name, address, birth date, Social Security Number).
The Privacy Rule excludes from protected health information employment records that a covered entity maintains in its capacity as an employer and education and certain other records subject to, or defined in, the Family Educational Rights and Privacy Act, 20 U.S.C. §1232g.
Now there’s obviously an overlap between personally identifiable information (PII) and protected health information (PHI). For example, names, dates of birth, and Social Security Numbers fall into both categories. But I want to highlight two things are are explicitly mentioned as PHI that aren’t usually cited as PII.
Physical or mental health data. This could include information that a medical professional captures from a patient, including biometric (broad biometric) information such as heart rate or blood pressure.
Health care provided to an individual. This not only includes written information such as prescriptions, but oral information (“take two aspirin and call my chatbot in the morning”). Yes, chatbot. Deal with it. Dr. Marcus Welby and his staff retired a long time ago.
Robert Young (“Marcus Welby”) and Jane Wyatt (“Margaret Anderson” on a different show). By ABC TelevisionUploaded by We hope at en.wikipedia – eBay itemphoto informationTransferred from en.wikipedia by SreeBot, Public Domain, https://commons.wikimedia.org/w/index.php?curid=16472486
Because broad biometric data used for analysis, such as heart rates, can be combined with other data to personally identify an individual, organizations that process biometric data must undertake strict safeguards to protect that data. If protected health information (PHI) is not adequately guarded, people could be subject to fraud and other harms.
Simple, isn’t it?
Actually, the parallels between identity/biometrics and healthcare have fascinated me for decades, since the dedicated hardware to capture identity/biometric data is often similar to the dedicated hardware to capture health data. And now that we’re moving away from dedicated hardware to multi-purpose hardware such as smartphones, the parallels are even more fascinating.
My decision making process relies on extensive data analysis and aligning with the company’s strategic objectives. It’s devoid of personal bias ensuring unbiased and strategic choices that prioritize the organization’s best interests.
Mika was brought to my attention by accomplished product marketer/artist Danuta (Dana) Deborgoska. (She’s appeared in the Bredemarket blog before, though not by name.) Dana is also Polish (but not Colombian) and clearly takes pride in the artificial intelligence accomplishments of this Polish-headquartered company. You can read her LinkedIn post to see her thoughts, one of which was as follows:
Data is the new oxygen, and we all know that we need clean data to innovate and sustain business models.
There’s a reference to oxygen again, but it’s certainly appropriate. Just as people cannot survive without oxygen, Generative AI cannot survive without data.
But the need for data predates AI models. From 2017:
Reliance Industries Chairman Mukesh Ambani said India is poised to grow…but to make that happen the country’s telecoms and IT industry would need to play a foundational role and create the necessary digital infrastructure.
Calling data the “oxygen” of the digital economy, Ambani said the telecom industry had the urgent task of empowering 1.3 billion Indians with the tools needed to flourish in the digital marketplace.
Of course, the presence or absence of data alone is not enough. As Debogorska notes, we don’t just need any data; we need CLEAN data, without error and without bias. Dirty data is like carbon monoxide, and as you know carbon monoxide is harmful…well, most of the time.
That’s been the challenge not only with artificial intelligence, but with ALL aspects of data gathering.
The all-male board of directors of a fertilizer company in 1960. Fair use. From the New York Times.
In all of these cases, someone (Amazon, Enron’s shareholders, or NIST) asked questions about the cleanliness of the data, and then set out to answer those questions.
In the case of Amazon’s recruitment tool and the company Enron, the answers caused Amazon to abandon the tool and Enron to abandon its existence.
Despite the entreaties of so-called privacy advocates (who prefer the privacy nightmare of physical driver’s licenses to the privacy-preserving features of mobile driver’s licenses), we have not abandoned facial recognition, but we’re definitely monitoring it in a statistical (not an anecdotal) sense.
The cleanliness of the data will continue to be the challenge as we apply artificial intelligence to new applications.
Why did I mention the “future implementation” of the UK Online Safety Act? Because the passage of the UK Online Safety Act is just the FIRST step in a long process. Ofcom still has to figure out how to implement the Act.
Ofcom started to work on this on November 9, but it’s going to take many months to finalize—I mean finalise things. This is the UK Online Safety Act, after all.
This is the first of four major consultations that Ofcom, as regulator of the new Online Safety Act, will publish as part of our work to establish the new regulations over the next 18 months.
It focuses on our proposals for how internet services that enable the sharing of user-generated content (‘user-to-user services’) and search services should approach their new duties relating to illegal content.
On November 9 Ofcom published a slew of summary and detailed documents. Here’s a brief excerpt from the overview.
Mae’r ddogfen hon yn rhoi crynodeb lefel uchel o bob pennod o’n hymgynghoriad ar niwed anghyfreithlon i helpu rhanddeiliaid i ddarllen a defnyddio ein dogfen ymgynghori. Mae manylion llawn ein cynigion a’r sail resymegol sylfaenol, yn ogystal â chwestiynau ymgynghori manwl, wedi’u nodi yn y ddogfen lawn. Dyma’r cyntaf o nifer o ymgyngoriadau y byddwn yn eu cyhoeddi o dan y Ddeddf Diogelwch Ar-lein. Mae ein strategaeth a’n map rheoleiddio llawn ar gael ar ein gwefan.
Oops, I seem to have quoted from the Welsh version. Maybe you’ll have better luck reading the English version.
This document sets out a high-level summary of each chapter of our illegal harms consultation to help stakeholders navigate and engage with our consultation document. The full detail of our proposals and the underlying rationale, as well as detailed consultation questions, are set out in the full document. This is the first of several consultations we will be publishing under the Online Safety Act. Our full regulatory roadmap and strategy is available on our website.
And if you need help telling your firm’s UK Online Safety Act story, Bredemarket can help. (Unless the final content needs to be in Welsh.) Click below!
Having passed, eventually, through the UK’s two houses of Parliament, the bill received royal assent (October 26)….
[A]dded in (to the Act) is a highly divisive requirement for messaging platforms to scan users’ messages for illegal material, such as child sexual abuse material, which tech companies and privacy campaigners say is an unwarranted attack on encryption.
This not only opens up issues regarding encryption and privacy, but also specific identity technologies such as age verification and age estimation.
This post looks at three types of firms that are affected by the UK Online Safety Act, the stories they are telling, and the stories they may need to tell in the future. What is YOUR firm’s Online Safety Act-related story?
What three types of firms are affected by the UK Online Safety Act?
As of now I have been unable to locate a full version of the final final Act, but presumably the provisions from this July 2023 version (PDF) have only undergone minor tweaks.
Among other things, this version discusses “User identity verification” in 65, “Category 1 service” in 96(10)(a), “United Kingdom user” in 228(1), and a multitude of other terms that affect how companies will conduct business under the Act.
I am focusing on three different types of companies:
Technology services (such as Yoti) that provide identity verification, including but not limited to age verification and age estimation.
User-to-user services (such as WhatsApp) that provide encrypted messages.
User-to-user services (such as Wikipedia) that allow users (including United Kingdom users) to contribute content.
What types of stories will these firms have to tell, now that the Act is law?
For ALL services, the story will vary as Ofcom decides how to implement the Act, but we are already seeing the stories from identity verification services. Here is what Yoti stated after the Act became law:
We have a range of age assurance solutions which allow platforms to know the age of users, without collecting vast amounts of personal information. These include:
Age estimation: a user’s age is estimated from a live facial image. They do not need to use identity documents or share any personal information. As soon as their age is estimated, their image is deleted – protecting their privacy at all times. Facial age estimation is 99% accurate and works fairly across all skin tones and ages.
Digital ID app: a free app which allows users to verify their age and identity using a government-issued identity document. Once verified, users can use the app to share specific information – they could just share their age or an ‘over 18’ proof of age.
MailOnline has approached WhatsApp’s parent company Meta for comment now that the Bill has received Royal Assent, but the firm has so far refused to comment.
[T]o comply with the new law, the platform says it would be forced to weaken its security, which would not only undermine the privacy of WhatsApp messages in the UK but also for every user worldwide.
‘Ninety-eight per cent of our users are outside the UK. They do not want us to lower the security of the product, and just as a straightforward matter, it would be an odd choice for us to choose to lower the security of the product in a way that would affect those 98 per cent of users,’ Mr Cathcart has previously said.
Companies, from Big Tech down to smaller platforms and messaging apps, will need to comply with a long list of new requirements, starting with age verification for their users. (Wikipedia, the eighth-most-visited website in the UK, has said it won’t be able to comply with the rule because it violates the Wikimedia Foundation’s principles on collecting data about its users.)
All of these firms have shared their stories either before or after the Act became law, and those stories will change depending upon what Ofcom decides.
Approximately 2,700 years ago, the Greek poet Hesiod is recorded as saying “moderation is best in all things.” This applies to government regulations, including encryption and age verification regulations. As the United Kingdom’s House of Lords works through drafts of its Online Safety Bill, interested parties are seeking to influence the level of regulation.
In Allan’s assessment, he wondered whether the mandated encryption and age verification regulations would apply to all services, or just critical services.
Allan considered a number of services, but I’m just going to hone in on two of them: WhatsApp and Wikipedia.
The Online Safety Bill and WhatsApp
WhatsApp is owned by a large American company called Meta, which causes two problems for regulators in the United Kingdom (and in Europe):
Meta is a large company.
Meta is an American company.
WhatsApp itself causes another problem for UK regulators:
WhatsApp encrypts messages.
Because of these three truths, UK regulators are not necessarily inclined to play nice with WhatsApp, which may affect whether WhatsApp will be required to comply with the Online Safety Bill’s regulations.
Allan explains the issue:
One of the powers the Bill gives to OFCOM (the UK Office of Communications) is the ability to order services to deploy specific technologies to detect terrorist and child sexual exploitation and abuse content….
But there may be cases where a provider believes that the technology it is being ordered to deploy would break essential functionality of its service and so would prefer to leave the UK rather than accept compliance with the order as a condition of remaining….
If OFCOM does issue this kind of order then we should expect to see some encrypted services leave the UK market, potentially including very popular ones like WhatsApp and iMessage.
Speaking during a UK visit in which he will meet legislators to discuss the government’s flagship internet regulation, Will Cathcart, Meta’s head of WhatsApp, described the bill as the most concerning piece of legislation currently being discussed in the western world.
He said: “It’s a remarkable thing to think about. There isn’t a way to change it in just one part of the world. Some countries have chosen to block it: that’s the reality of shipping a secure product. We’ve recently been blocked in Iran, for example. But we’ve never seen a liberal democracy do that.
“The reality is, our users all around the world want security,” said Cathcart. “Ninety-eight per cent of our users are outside the UK. They do not want us to lower the security of the product, and just as a straightforward matter, it would be an odd choice for us to choose to lower the security of the product in a way that would affect those 98% of users.”
In passing, the March Guardian article noted that WhatsApp requires UK users to be 16 years old. This doesn’t appear to be an issue for Meta, but could be an issue for another very popular online service.
The Online Safety Bill and Wikipedia
So how does the Online Safety Bill affect Wikipedia?
It depends on how the Online Safety Bill is implemented via the rulemaking process.
As in other countries, the true effects of legislation aren’t apparent until the government writes the rules that implement the legislation. It’s possible that the rulemaking will carve out an exemption allowing Wikipedia to NOT enforce age verification. Or it’s possible that Wikipedia will be mandated to enforce age verification for its writers.
If they do not (carve out exemptions) then there could be real challenges for the continued operation of some valuable services in the UK given what we know about the requirements in the Bill and the operating principles of services like Wikipedia.
For example, it would be entirely inconsistent with Wikipedia’s privacy principles to start collecting additional data about the age of their users and yet this is what will be expected from regulated services more generally.
Left unsaid is the same issue that affects encryption: age verification for Wikipedia may be required in the United Kingdom, but may not be required for other countries.
(Wales) used the example of Wikipedia, in which none of its 700 staff or contractors plays a role in content or in moderation.
Instead, the organisation relies on its global community to make democratic decisions on content moderation, and have contentious discussions in public.
By contrast, the “feudal” approach sees major platforms make decisions centrally, erratically, inconsistently, often using automation, and in secret.
By regulating all social media under the assumption that it’s all exactly like Facebook and Twitter, Wales said that authorities would impose rules on upstart competitors that force them into that same model.
One common thread between these two cases is that implementation of the regulations results in a privacy threat to the affected individuals.
For WhatsApp users, the privacy threat is obvious. If WhatsApp is forced to fully or partially disable encryption, or is forced to use an encryption scheme that the UK Government could break, then the privacy of every message (including messages between people outside the UK) would be threatened.
For Wikipedia users, anyone contributing to the site would need to undergo substantial identity verification so that the UK Government would know the ages of Wikipedia contributors.
This is yet another example of different government agencies working at cross purposes with each other, as the “catch the pornographers” bureaucrats battle with the “preserve privacy” advocates.
Meta, Wikipedia, and other firms would like the legislation to explicitly carve out exemptions for their firms and services. Opponents say that legislative carve outs aren’t necessary, because no one would ever want to regulate Wikipedia.
Yeah, and the U.S. Social Security Number isn’t an identificaiton number either. (Not true.)