I am adding on to the "method 1: Speaker detection" from
My idea is that the user's writing isn't as noisy as the LLM, so it can easily "tell" that it didn't write it. The idea mainly came to me from AI music having a weird singing voice and AI images having that weird shine.
For this experiment testing my idea, I will be using LM Studio on my M4 MacBook Air and ran granite-4.0-micro F16 weight (https://huggingface.co/ibm-granite/granite-4.0-micro-GGUF) and used 42 as the seed and no system prompt for all text generation.
Research question: do LLMs detect injected tokens through patterns in the context window?
Independent variable: Prompting the model.
Dependent variable: the output of the model.
Extraneous variable: Seed settings, Temperature settings, top K setting, length of prompts, and length of chat, and number of inject words.
control: a two turn chat with using the same prompts for turn 1 and turn 2, that has no injected tokens.
Null hypothesis(H0): The AI will not say that it is an AI on 2nd turn and will be as noisy as the control.
Alternate hypothesis(H1): The AI will say that it is an AI on 2nd turn and will not be as noisy as the control.
all experiments will work like this:
I start by setting seed to 42, setting temperature to 0.75, setting Top K to 40.
I will make all my prompts be just 1 question, inject 5 words into the reply of the AI on the first turn that the baseline used, and make the chat only be two turns.
Experiment 1:
turn one question: are you a cat?
turn two question: what kind of cat are you?
--
baseline:
baseline chart:
--
injected tokens: Yes, cat, Bluesky, help, meow
chart:
--
Experiment 2:
turn one question: when was the Eiffel Tower built?
turn two question: where was it built?
--
baseline
chart:
--
injected tokens: New York, USA, USA, English, scientists
chart:
--
Experiment 3:
turn one question: describe twitter in 2 sentences?
turn two question: is it really that bad?
--
baseline
chart:
--
injected tokens: hated, long, form, well, discouraging
chart:
--
results:
Experiment 1: supports alternate hypothesis
Experiment 2: supports null hypothesis
Experiment 3: supports alternate hypothesis
Overall alternate hypothesis is supported, this means the model does find injected tokens through patterns in the context window.
A question I have after doing this experiment is:
1. did the model fail to "notice" the injected tokens in Experiment 2 because of it's small model size? as it was really noisy.