30 May 2023

AI and My Content

Use of any content on this site is forbidden for artificial intelligence and machine learning training. It doesn’t matter whether whoever building the AI model is willing to attribute all data used in their dataset in a public forum and willing to share their resulting model in Creative Commons or similarly permissive licence with no commercial gain. I am not permitting anyone to use my content for the purpose is training AI and ML model.

No exceptions. Even my family is not allowed to create a pretend Murteza trained on any content on this site, regardless of in which stage of life I am in.

I like sharing, reading others’ opinions on topics and to talk about them. And I want to encourage it. The very purpose of this site is to encourage dialogue among people, not between humans and machines. There is enough social tension in this world and I see conversation as the only medicine. But preventing conversation by replacing a party with a machine undermines that purpose.

Second reason I am taking this stand is the bias problem in artificial intelligence models. We all have biases. We develop them throughout our lives. And everything we build will have our bias to some degree. I want to avoid keep adding into this fire.

Third concern is impersonation. If my content goes into a language model, it will be able to imitate my writing style to very convincing level. This can be used against me or other people in my life. Scam is the most common yet valid example for this use. Scammers upped their game by using large language models such as ChatGPT to produce convincing fake receipts to scam people into “paying a debt” they never had. I don’t want to contribute to this fire.

This is hypocrisy, you might say

I don’t believe I am being hypocritical by not wanting my website to be used in AI training. My reasons for blocking AI is for humans to communicate more and this human interaction to be not replaced by robots. I actually do believe that I would be a hypocrite if I contribute to development of AI chatbots and whatnot.

Telling that to robots in robot language

Computer programs that go from website to website and extract data from each website they visit are called “web crawlers” or “spiders”. Those programs won’t understand that I don’t want them to collect data from my site for the purpose of AI training. Instead, they read some file in every website called robots.txt file. That file defines which part of a website is disallowed for them to “crawl”. Even though you are a human, I assume, you can read this site’s robots.txt here.

Below you will find some code blocks you can add to your site’s robots.txt if you want to block web crawlers gathering data for AI projects. Because every webpage can have different file structure, I assumed your site is a static site like mine. If you have a dynamic site, such as those powered by WordPress, you should remember to add directories such as /bin and /admin to disallow to search engines as well. I strongly advise you to read your site’s robots.txt before overriding it.

If you are paying someone else to manage your site for you, you may need to contact them for robots.txt to be updated with these rules.

Method 1: NOT RECOMMENDED - Disallow all but search engines

This is an easy, quick, but dirty method. I do not recommend this. It tells search engine bots to crawl entire website while telling all other bots to not crawl anything. This may only be desirable on static sites such as mine. This method will also block bots not related to AI, such as Internet Archive, which is why this method is not recommended.

# Be aware of bad bots. robots.txt is considered voluntary and
# there are crawler bots which may not obey disallow rules
# defined here. To block bad bots from overloading your website,
# you will need to use shell scripts and firewall rules.
#
# This setup will allow entire site to search engines which
# probably only is desirable on static sites. If your site is not
# a static site, add disallow rules fordirectories such as /bin
# and /admin. 
#
# If there are more search engines you would like to allow, you
# will need to dig and find more about their crawler bots.
# Good place to start would be:
# https://en.wikipedia.org/wiki/List_of_search_engines
#
# Remember to add sitemap at the end if you have any.


###   SPECIAL CRAWLERS - You may want to allow these   ###
## Internet Archive
# Even though Internet Archive allows all archived webpages to
# be viewed by anyone, unlike Common Crawl, mass download of
# the archives is not allowed. Thus requiring site to be crawled
# again anyway.
# https://archive.org/details/webwidecrawl?tab=about
# 
# To allow Internet Archive, remove "/" in Disallow line.
User-agent: ia-archiver
User-agent: ia-archiver-web.archive.org
User-agent: ia_archiver
User-agent: ia_archiver-web.archive.org
User-agent: archive.org_bot
Disallow: /


###   ALLOWED   ###
## Google
User-agent: googlebot
Disallow:
User-agent: googlebot-image
Disallow:
User-agent: googlebot-mobile
Disallow:
User-agent: googlebot-news
Disallow:

## Microsoft Bing
User-agent: bingbot
Disallow:

## Yahoo
User-agent: Slurp
Disallow:

## Brave Search
# Brave hasn't disclosed much about their crawler yet, thus we
# don't know enough to allow Brave Search. Sad.
# User-agent: 
# Disallow:

## Yandex 
# Comment out below 3 lines to disallow Yandex to crawl your site
User-agent: Yandex
User-agent: YandexBot 
Disallow:


###   DISALLOWED   ###
# Disallow entire website to every other bot
User-agent: *
Disallow: /


####   SITEMAP   ###
# Remember to add your sitemap to inform allowed robots know
# where new and updated pages are and where they should focus.
# This is considered important for SEO.
Sitemap:

Method 2: Add individual rules for each AI project

If you want to find each web crawler’s “User-agent” identifier, you may need to dig hard on the internet. Below are the AI crawler robots I could find. If you know other project and could find their identifiers, please inform me on Mastodon or via email. My contact information is in footer.

Common Crawl

Common Crawl is not specifically for AI. But, since Common Crawl makes data it collected from our sites available to everyone for free, it became an indispensable data bank for everybody with an AI project.

User-agent: CCBot
Disallow: /

ChatGPT by OpenAI

According to OpenAI’s documentation, ChatGPT does honour robots.txt disallow.

User-agent: ChatGPT-User
Disallow: /

Bard by Google

There is no documentation about what the Bard’s crawler is/will be called on Bard’s official website. Also, how they gathered the data for the experimental build of Bard is kept secret too.

Meta/Facebook AI

Meta is keeping the name of their AI super secret AFAIK. If you know more about it, please share so that I can update this section.

Comments

Reply either on Fediverse or via email. Only with your consent, I will publicly share your comment here.