The role of weak (fingerprinting) signals in bot and fraud detection
When it comes to (browser) fingerprinting for fraud and bot detection, we can distinguish two categories of signals:
- Strong signals. These are specially crafted JS challenges that aim to detect side effects introduced by anti-detect browsers/bot frameworks, e.g. if they override native properties/getters/APIs to return less suspicious values. For example, a lot of bot frameworks and anti-detect browsers override the canvas API toDataURL function to lie about their canvas fingerprint. In the case of Puppeteer extra stealth, a modified version of Puppeteer, it was also interesting to detect whether or not the navigator.plugins object had been overridden.
- Weak signals. These signals provide context and general information about the user’s browser, OS and device. For example, information about the GPU (collected using the webGL and webGPU APIs), information about the timezone, and the user language preferences.
Strong fingerprinting signals are easy to leverage:
when their value is true, e.g. navigator.webdriver = true,
or when you detect that
toDataURL
has been overridden, you know that there’s something abnormal: a bot, a user with
an anti-detect browser (or certain browser extensions). This information can be easily exploited for
detection purposes.
On the other side, weak signals are more difficult to leverage as is. You often need to combine them, to correlate their values in order to leverage them for detection purposes. For example, attributes related to the GPU depend on the user’s device and OS. Thus, if the user claims to be on mobile but has a GPU model only available on Windows/desktop, you know there’s something potentially abnormal. This requires you to know the relationships between different attributes.
But that’s not only about the GPU. For example, the fingerprinting test on device and browser info collects the user language preferences using 5 different ways:
- With the
accept-language
HTTP header (server side)
navigator.languages
(client side using JS)
navigator.language
(client side using JS)
speechSynthesis.getVoices()
(JS speech synthesis API)
Intl.RelativeTimeFormat().resolvedOptions().locale
(JS internationalisation API)
These APIs return different values, e.g.
navigator.languages
returns an array of languages, such as en-GB,en-US
and the
speech synthesis API returns an object about the default language that looks as follows :
{"voiceURI":"Daniel (English (United Kingdom))","name":"Daniel (English (United Kingdom))","lang":"en-GB","localService":true,"default":true}
However, the different values returned by these 5
APIs/headers are supposed to be consistent with each other. In my case, my default preferred language is
en-GB
. You expect to find it in the other attributes (which is the case):
- Accept-language:
en-GB,en-US;q=0.9,en;q=0.8
- navigator.languages:
en-GB,en-US,en
- navigator.language:
en-GB
{"voiceURI":"Daniel (English (United Kingdom))","name":"Daniel (English (United Kingdom))","lang":"en-GB","localService":true,"default":true}
- Intl.RelativeTimeFormat().resolvedOptions().locale:
en-GB
Since fraudsters and bots tend to lie about their
fingerprint, either in the hope of erasing inconsistencies and suspicious side effects linked to their
tools, or to clone the fingerprint of their victim, checking each fingerprinting signal against other
correlated signals may help to reveal inconsistent or abnormal combinations of signals, e.g. someone
whose accept-language is en-GB
but whose default speechSynthesis voice is Russian.
Using ML to handle weak signals: potential feature engineering
Using weak signals can be challenging as you can quickly end up with a lot of signals that may have a high cardinality (a lot of different possible values).
A simple way to handle weak signals is to merge them/aggregate them into a signal that reflects whether or not the relationship between different signals is what you expect.
For example, in the case of the languages signals, you
could check whether or not the first/main preferred language of the accept-language HTTP header is
present in all the other language signals and create a signal named
hasConsistentLanguagesSignals
whose value is true
when all the signals contain
the first value of accept-language.
You could go even further by creating another signal
named hasStrictConsistentLanguagesSignals
that checks if the first
accept-language
language is also the first/default language of other language signals.
Due to the high number of edge cases depending on the OS/browser/device, it may be easier to use these signals in ML models that may be better at leveraging combinations of weak signals versus a more static rule-based approach.
Weak signals for fingerprint tracking
Another benefit of weak fingerprinting signals not discussed in this article, as this wasn’t the main topic, is to increase the uniqueness of a browser fingerprint.
In the case of fraud, browser fingerprinting can be used both to detect inconsistent values (what we discussed before) or to create a +- unique and stable identifier that can be used to track a user’s browser. This is particularly useful in addition to cookie-based tracking since it’s highly likely that attackers may:
- Change their IP addresses using (residential) proxies or VPNs;
- Clean their session cookies either using private mode or by erasing them manually.
Thus, having more weak fingerprinting signals such as the timezone, the screen resolution, or the number of CPU cores can be used to build a more unique fingerprint, that can serve as a short-term identifier to track malicious actor, even if they change their IP and clean their cookies.
Note that attackers are aware of it. That’s why they will try to change their fingerprint to avoid being tracked and too easily detected.
Conclusion
While strong fingerprinting signals that aim to detect side effects introduced by bot frameworks and anti-detect browsers are crucial in the context of fraud and bot detection, collecting weak fingerprinting signals that provide information about the user device, browser and OS is also important.
By collecting several weak fingerprinting signals that
are collected with each other, e.g. accept-language
, navigator.language
and
navigator.languages
, you can verify if their values are consistent. Indeed, since weak
signals can be used to build +- unique short-term fingerprints, fraudsters, and bots tend to modify them
to avoid being tracked. However, by doing it they run the risk of introducing inconsistent values that
can be used against them.