Is Zoom using your video calls to train AI?
Somebody went through Zoom’s terms of service with a fine comb, as one should, and the internet went ablaze with the prospect of Zoom claiming the right to use content to train AI. How bad is it?
The terms, sections 10.2 in particular, outline Zoom’s rights to collect and employ “Service Generated Data”; encompassing telemetry, product usage, and diagnostic data collected during user interactions with their services; to develop their services and AI. Zoom considers the data they generate as their property, and claims the right to use it as they wish.
As the internet went ablaze with people sharing their concerns about this, Zoom’s PR team scrambled to respond, claiming that users still own their content, generated data is used to tweak the provided service, and only generative AI services are used with customer content to provide services such as meeting summaries. Zoom also added the following line to their terms:
Notwithstanding the above, Zoom will not use audio, video or chat Customer Content to train our artificial intelligence models without your consent.August 7, 2023 – Zoom Terms of Service
Zoom insists that they do not use customer content to train AI, and that the extensive rights they claim to user content are mentioned for transparency’s sake and only used to provide and better their service.
It’s common for companies to preserve the legal rights to as much as possible in their terms of service, even if they do not intend to use those rights. This might be the case in this instance, as well. Still, it’s good to remember that this is the same company that claimed to provice strong end-to-end encryption without actually doing so and installed a web server on user’s machines, all in order to provide and better their service. The terms of service set the limits of what they can do, no matter what they communicate publicly. Still, their terms of service are quite extensive and transparent in what they use and how.
Grammarly is using user writings to train their AI
As revealed by a Mastodon post by @[email protected], Grammarly is feeding all their users’ writings to their AI, and the only way to opt-out is to pay for 500+ business licenses. Grammarly claims to disassociate texts from user accounts.
Hypotethically, if your therapist uses Grammarly to write a summary about your sessions and includes your name in the text, there is still a possibility that creative prompt engineers can extract sensitive information from the resulting AI.
While having your spelling and grammar checker evolve based on how you write is not unexpected, and can even be desirable, the generative qualities of the resulting AI can leak information from the training dataset—in fact, imitating the original content is exactly what text-generative AI is meant to do.
Everyone is training their AI on published works on the internet
OpenAI launched a new web crawler service GPTBot, which browses through the public internet. If you wish, you can prevent OpenAI GPTBot from using your website by adding the following lines to your site’s robots.txt at the root of your domain:
User-agent: GPTBot Disallow: /
Notably, this setting works only now, that GPT-4 has already been trained, presumably, on publicly available information. Nobody knows exactly what data OpenAI’s GPT-4 and ChatGPT have been trained on.
Brave has launched an AI Summarizer for web content, and is also selling data for AI training. Their crawler respects Googlebot user-agent instructions, andtThey also use user browsers to crawl the internet, though the user-setting for enrolling in the Web Discovery Project (WDP) is opt-in.
Google also employs their internet index to train Bard. Bard is based on LaMDA, a language model trained with a dataset called Infiniset. The composition of this dataset is detailed in the original LaMDA research paper, and it consists of 2.87B documents, including Wikipedia (12.5%) and Q&A Sites (12.5%), as well as 1.12B dialogs, of which 50% come from public forums.
AI imitation is the new norm
Scraping user content from public sources has rapidly become the industry standard. Companies are amassing vast amounts of user data, from published texts and images to personal interactions, to feed into new AI models. However, this data aggregation often occurs without explicit consent or awareness from users, compromising their privacy.
The ethical considerations are vast. Without informed consent, users and their information are treated as commodities—means to an end without regard to their inherent worth and autonomy. From a virtue ethics perspective, this is the very definition of unethical action. Training AI models with potentially sensitive information creates the possibility of this information leaking to the public, either through misuse or mistake. Without transparency on what information is used and how, it is nearly impossible to ensure that the models do not have inherent bias.
Striking a balance between technological advancement and safeguarding user privacy has become paramount. There is urgent need for transparent data usage policies and stringent ethical guidelines in the evolving landscape of AI-driven applications.