How to get started in bot detection and bot development?

I get frequent messages from people who want to know how to get started in the bot detection world. I decided to centralize my answer in this article where I present the topics and blogs/articles/websites/papers that are worth reading to learn about bot detection.

Of course, you can argue that some of them are more oriented toward bot development rather than bot detection, but both fields are connected. You need to know how to develop bots to detect them, but the opposite is also true, you need to know how bot detection works to bypass it.

The article is not exhaustive. Feel free to ping me on Twitter or LinkedIn if you think I should include other articles/websites/resources.

Note that in 2020, I had already written an overview of bot detection techniques on my personal website. While it’s getting old, it’s still relevant: https://antoinevastel.com/javascript/2020/02/09/detecting-web-bots.html.

Topics to study

  • Browser/TCP/TLS Fingerprinting: main categories of signals used to detect bots. It’s important to know how fingerprints are collected and how they can be modified.
  • JavaScript/Python: JavaScript is the language of the web. You will find a lot of automation frameworks in NodeJS (Puppeteer, Playwright). Moreover, knowing JS will make it easier to interact with web pages. However, Python is not a bad choice either, most automation frameworks are also available in Python.
  • DOM/HTML/CSS selectors: knowing how to extract specific information from HTML documents is key for scraping. CSS selectors are also highly useful for all kinds of bots to specify which elements of the DOM you want to interact with (click, mouse move, etc)
  • HTTP requests (GET/POST, headers): When it comes to bot development, you can either use a real/headless automated browser or make raw HTTP requests using an HTTP client. The latter uses less computational resources (CPU, RAM) but requires you to understand how to make proper HTTP requests, with the correct HTTP headers, to interact with a website or its APIs.
  • Automation frameworks: Puppeteer, Selenium, Playwright
  • JS Reverse engineering/deobfuscation: analyzing obfuscated JavaScript is key to understanding the signals collected by a bot detection fingerprinting script.
  • Anti-detect bot frameworks: Vanilla automation frameworks like Puppeteer may not be enough against protected websites. Anti-detect frameworks aim to erase the known side effects of automation frameworks to make them less easily detectable.
  • (residential, data center) Proxies: proxies are commonly used by bots to distribute their attacks across thousands of IPs, which enable them to avoid being blocked by simple IP-based rate-limiting techniques
  • Cookies: Most websites and bot detection vendors have a notion of session that is frequently handled by cookies. It’s key to understand how cookies work in general and how they’re handled by browsers.
  • Basics of machine learning: when it comes to bot detection, creating detection rules manually is not scalable. At some point, you want to generalize your detection using ML algorithms to adapt it depending on the context/website.

In general, when it comes to browser and web-related topics, https://developer.mozilla.org/en-US/ (MDN) provides high quality content.

Discord and Reddit

Bot detection and development is a sort of cat-and-mouse game. It keeps on evolving because new frameworks get developed, browser vendors release new APIs, and attackers and defenders adapt. While you can find academic papers discussing certain bot-related topics in-depth, a lot of the conversation happens on Discord and Reddit

Newsletter

Pierluigi Vinciguerra writes the web Scraping club newsletter: https://substack.thewebscraping.club/. Every week, he publishes articles related to scraping, browser fingerprinting, proxies and anti-detect bot frameworks. It’s worth subscribing to get access to access to up-to-date content.

Blogs (individuals):

  • Note that other great individual blogs are mentioned in the deobfuscation section.

Corporate blogs (of bot detection companies and scraping companies):

Obfuscation/Deobfuscation

To obfuscate or deobfuscate JavaScript programs, you need to be familiar with the notion of AST (abstract syntax tree). It enables you to manipulate a JS program more easily and accurately than using regex or string parsing. AST manipulation can be done using Babel (or other libraries). You can learn more about AST manipulation with Babel in their handbook. AST explorer is also really helpful to visualize an AST.

List of (de)obfuscation blogs

Online fingerprinting tests and repositories

CAPTCHA farms

Vanilla automation frameworks

Anti detect browser frameworks and libraries

It’s important to understand the kinds of techniques used by anti-detect frameworks to bypass detection techniques. It can be used to craft new detection signals that aim to detect the potential side effects introduced by the evasion techniques.

Anti-detect libraries:

Academic papers and academic conferences

I list a few papers below, in general, I recommend to keep an eye on the following academic security conferences:

Academic papers:

Scraping bots as a service (BaaS) and proxy networks

Other curated lists of bot-related content:

Other recommended articles

Investigating the Selenium Chrome mode of Open Bullet 2

Fourth article of a series about Open Bullet 2, a credential stuffing tool. We analyze the the Selenium Chrome mode to better understand how it works, its browser fingerprint, and how it can be detected.

Read more

Published on: 05-09-2024

Investigating the Puppeteer mode of Open Bullet 2 (credential stuffing tool)

Third article of a series about Open Bullet 2, a credential stuffing tool. We analyze the the Puppeteer mode to better understand how it works, its browser fingerprint, and how it can be detected.

Read more

Published on: 08-08-2024

Privacy leak: detecting anti-canvas fingerprinting browser extensions

In this article, we present 2 approaches that can be used to detect anti-canvas fingerprinting countermeasures and we discuss the potential consequences in terms of privacy for their users.

Read more

Published on: 29-06-2024