The role of weak (fingerprinting) signals in bot and fraud detection

When it comes to (browser) fingerprinting for fraud and bot detection, we can distinguish two categories of signals:

  • Strong signals. These are specially crafted JS challenges that aim to detect side effects introduced by anti-detect browsers/bot frameworks, e.g. if they override native properties/getters/APIs to return less suspicious values. For example, a lot of bot frameworks and anti-detect browsers override the canvas API toDataURL function to lie about their canvas fingerprint. In the case of Puppeteer extra stealth, a modified version of Puppeteer, it was also interesting to detect whether or not the navigator.plugins object had been overridden.
  • Weak signals. These signals provide context and general information about the user’s browser, OS and device. For example, information about the GPU (collected using the webGL and webGPU APIs), information about the timezone, and the user language preferences.

Strong fingerprinting signals are easy to leverage: when their value is true, e.g. navigator.webdriver = true, or when you detect that toDataURL has been overridden, you know that there’s something abnormal: a bot, a user with an anti-detect browser (or certain browser extensions). This information can be easily exploited for detection purposes.

On the other side, weak signals are more difficult to leverage as is. You often need to combine them, to correlate their values in order to leverage them for detection purposes. For example, attributes related to the GPU depend on the user’s device and OS. Thus, if the user claims to be on mobile but has a GPU model only available on Windows/desktop, you know there’s something potentially abnormal. This requires you to know the relationships between different attributes.

But that’s not only about the GPU. For example, the fingerprinting test on device and browser info collects the user language preferences using 5 different ways:

  • With the accept-language HTTP header (server side)
  • navigator.languages (client side using JS)
  • navigator.language (client side using JS)
  • speechSynthesis.getVoices() (JS speech synthesis API)
  • Intl.RelativeTimeFormat().resolvedOptions().locale (JS internationalisation API)

These APIs return different values, e.g. navigator.languages returns an array of languages, such as en-GB,en-US and the speech synthesis API returns an object about the default language that looks as follows :

{"voiceURI":"Daniel (English (United Kingdom))","name":"Daniel (English (United Kingdom))","lang":"en-GB","localService":true,"default":true}	

However, the different values returned by these 5 APIs/headers are supposed to be consistent with each other. In my case, my default preferred language is en-GB. You expect to find it in the other attributes (which is the case):

  • Accept-language: en-GB,en-US;q=0.9,en;q=0.8
  • navigator.languages: en-GB,en-US,en
  • navigator.language: en-GB
{"voiceURI":"Daniel (English (United Kingdom))","name":"Daniel (English (United Kingdom))","lang":"en-GB","localService":true,"default":true}	
  • Intl.RelativeTimeFormat().resolvedOptions().locale: en-GB

Since fraudsters and bots tend to lie about their fingerprint, either in the hope of erasing inconsistencies and suspicious side effects linked to their tools, or to clone the fingerprint of their victim, checking each fingerprinting signal against other correlated signals may help to reveal inconsistent or abnormal combinations of signals, e.g. someone whose accept-language is en-GB but whose default speechSynthesis voice is Russian.

Using ML to handle weak signals: potential feature engineering

Using weak signals can be challenging as you can quickly end up with a lot of signals that may have a high cardinality (a lot of different possible values).

A simple way to handle weak signals is to merge them/aggregate them into a signal that reflects whether or not the relationship between different signals is what you expect.

For example, in the case of the languages signals, you could check whether or not the first/main preferred language of the accept-language HTTP header is present in all the other language signals and create a signal named hasConsistentLanguagesSignals whose value is true when all the signals contain the first value of accept-language.

You could go even further by creating another signal named hasStrictConsistentLanguagesSignals that checks if the first accept-language language is also the first/default language of other language signals.

Due to the high number of edge cases depending on the OS/browser/device, it may be easier to use these signals in ML models that may be better at leveraging combinations of weak signals versus a more static rule-based approach.

Weak signals for fingerprint tracking

Another benefit of weak fingerprinting signals not discussed in this article, as this wasn’t the main topic, is to increase the uniqueness of a browser fingerprint.

In the case of fraud, browser fingerprinting can be used both to detect inconsistent values (what we discussed before) or to create a +- unique and stable identifier that can be used to track a user’s browser. This is particularly useful in addition to cookie-based tracking since it’s highly likely that attackers may:

  1. Change their IP addresses using (residential) proxies or VPNs;
  1. Clean their session cookies either using private mode or by erasing them manually.

Thus, having more weak fingerprinting signals such as the timezone, the screen resolution, or the number of CPU cores can be used to build a more unique fingerprint, that can serve as a short-term identifier to track malicious actor, even if they change their IP and clean their cookies.

Note that attackers are aware of it. That’s why they will try to change their fingerprint to avoid being tracked and too easily detected.

Conclusion

While strong fingerprinting signals that aim to detect side effects introduced by bot frameworks and anti-detect browsers are crucial in the context of fraud and bot detection, collecting weak fingerprinting signals that provide information about the user device, browser and OS is also important.

By collecting several weak fingerprinting signals that are collected with each other, e.g. accept-language , navigator.language and navigator.languages, you can verify if their values are consistent. Indeed, since weak signals can be used to build +- unique short-term fingerprints, fraudsters, and bots tend to modify them to avoid being tracked. However, by doing it they run the risk of introducing inconsistent values that can be used against them.

Other recommended articles

Privacy leak: detecting anti-canvas fingerprinting browser extensions

In this article, we present 2 approaches that can be used to detect anti-canvas fingerprinting countermeasures and we discuss the potential consequences in terms of privacy for their users.

Read more

Published on: 29-06-2024

Fraud detection: how to detect if a user lied about its OS and infer its real OS?

In this article, we explain how we explain how you can detect that a user lied about the real nature of its OS by modifying its user agent. We provide different techniques that enable you to retrieve the real nature of the OS using JavaScript APIs such as WebGL and getHighEntropyValues.

Read more

Published on: 11-06-2024

(Unmodified) Headless Chrome instrumented with Puppeteer: How consistent is the fingerprint in 2024?

In this article, we conduct a deep dive analysis of the fingerprint of an unmodified headless Chrome instrumented with Puppeteer browser. We compare it with the fingerprint of a normal Chrome browser used by a human user to identify the main differences and see if they can be leveraged for bot detection.

Read more

Published on: 02-06-2024