How to get started in bot detection and bot development?

I get frequent messages from people who want to know how to get started in the bot detection world. I decided to centralize my answer in this article where I present the topics and blogs/articles/websites/papers that are worth reading to learn about bot detection.

Of course, you can argue that some of them are more oriented toward bot development rather than bot detection, but both fields are connected. You need to know how to develop bots to detect them, but the opposite is also true, you need to know how bot detection works to bypass it.

The article is not exhaustive. Feel free to ping me on Twitter or LinkedIn if you think I should include other articles/websites/resources.

Note that in 2020, I had already written an overview of bot detection techniques on my personal website. While it’s getting old, it’s still relevant: https://antoinevastel.com/javascript/2020/02/09/detecting-web-bots.html.

Topics to study

Browser/TCP/TLS Fingerprinting: main categories of signals used to detect bots. It’s important to know how fingerprints are collected and how they can be modified.

JavaScript/Python: JavaScript is the language of the web. You will find a lot of automation frameworks in NodeJS (Puppeteer, Playwright). Moreover, knowing JS will make it easier to interact with web pages. However, Python is not a bad choice either, most automation frameworks are also available in Python.

DOM/HTML/CSS selectors: knowing how to extract specific information from HTML documents is key for scraping. CSS selectors are also highly useful for all kinds of bots to specify which elements of the DOM you want to interact with (click, mouse move, etc)

HTTP requests (GET/POST, headers): When it comes to bot development, you can either use a real/headless automated browser or make raw HTTP requests using an HTTP client. The latter uses less computational resources (CPU, RAM) but requires you to understand how to make proper HTTP requests, with the correct HTTP headers, to interact with a website or its APIs.

Automation frameworks: Puppeteer, Selenium, Playwright

JS Reverse engineering/deobfuscation: analyzing obfuscated JavaScript is key to understanding the signals collected by a bot detection fingerprinting script.

Anti-detect bot frameworks: Vanilla automation frameworks like Puppeteer may not be enough against protected websites. Anti-detect frameworks aim to erase the known side effects of automation frameworks to make them less easily detectable.

(residential, data center) Proxies: proxies are commonly used by bots to distribute their attacks across thousands of IPs, which enable them to avoid being blocked by simple IP-based rate-limiting techniques

Cookies: Most websites and bot detection vendors have a notion of session that is frequently handled by cookies. It’s key to understand how cookies work in general and how they’re handled by browsers.

Basics of machine learning: when it comes to bot detection, creating detection rules manually is not scalable. At some point, you want to generalize your detection using ML algorithms to adapt it depending on the context/website.

Browser/Chrome dev tools (look at requests, debugger, set breakpoints etc): https://developer.chrome.com/docs/devtools/tips

In general, when it comes to browser and web-related topics, https://developer.mozilla.org/en-US/ (MDN) provides high quality content.

Discord and Reddit

Bot detection and development is a sort of cat-and-mouse game. It keeps on evolving because new frameworks get developed, browser vendors release new APIs, and attackers and defenders adapt. While you can find academic papers discussing certain bot-related topics in-depth, a lot of the conversation happens on Discord and Reddit

https://discord.com/invite/vz7PeKk: Scraping enthusiast, related to puppeteer extra stealth

https://discord.gg/sneakerdev: sneaker bot dev (but not only)

https://www.reddit.com/r/webscraping/: mostly related to scraping, but the concepts can be applied to other types of bots

Pierluigi Vinciguerra writes the web Scraping club newsletter: https://substack.thewebscraping.club/. Every week, he publishes articles related to scraping, browser fingerprinting, proxies and anti-detect bot frameworks. It’s worth subscribing to get access to access to up-to-date content.

Blogs (individuals):

https://deviceandbrowserinfo.com/learning_zone (this website)

https://incolumitas.com/

Note that other great individual blogs are mentioned in the deobfuscation section.

Corporate blogs (of bot detection companies and scraping companies):

https://datadome.co/threat-research/ (I also write on this blog)

https://fingerprint.com/blog/tag/engineering/

https://blog.cloudflare.com/tag/bots/

https://www.zenrows.com/blog/

https://www.scrapingbee.com/blog/

https://brightdata.com/blog

https://rebrowser.net/blog (lately they published interesting articles about CDP bypasses https://rebrowser.net/blog/how-to-fix-runtime-enable-cdp-detection-of-puppeteer-playwright-and-other-automation-libraries-61740)

Obfuscation/Deobfuscation

To obfuscate or deobfuscate JavaScript programs, you need to be familiar with the notion of AST (abstract syntax tree). It enables you to manipulate a JS program more easily and accurately than using regex or string parsing. AST manipulation can be done using Babel (or other libraries). You can learn more about AST manipulation with Babel in their handbook. AST explorer is also really helpful to visualize an AST.

List of (de)obfuscation blogs

https://steakenthusiast.github.io/

https://jwillbold.com/posts/obfuscation/2019-06-16-the-secret-guide-to-virtualization-obfuscation-in-javascript/

https://ibiyemiabiodun.com/projects/reversing-tiktok-pt2/

https://www.trickster.dev/post/javascript-ast-manipulation-with-babel-transform-prototyping-and-plugin-development/

Live streaming JavaScript deobfuscation & reverse engineering by Jarrod Overson

Online fingerprinting tests and repositories

https://abrahamjuliot.github.io/creepjs/

https://deviceandbrowserinfo.com/info_device

https://fingerprint-scan.com/

https://niespodd.github.io/browser-fingerprinting/

https://browserleaks.com/

https://github.com/HMaker/HMaker.github.io/blob/master/selenium-detector/chromedriver.js

https://github.com/kaliiiiiiiiii/driverless-fp-collector (by the developer of Selenium driverless)

https://github.com/kaliiiiiiiiii/brotector (by the developer of Selenium driverless)

TCP fingerprinting test (and blog post): https://incolumitas.com/2021/03/13/tcp-ip-fingerprinting-for-vpn-and-proxy-detection/

CAPTCHA farms

https://www.capsolver.com/

https://anti-captcha.com/

https://2captcha.com/

Vanilla automation frameworks

https://pptr.dev/

https://playwright.dev/

https://www.selenium.dev/

Anti detect browser frameworks and libraries

It’s important to understand the kinds of techniques used by anti-detect frameworks to bypass detection techniques. It can be used to craft new detection signals that aim to detect the potential side effects introduced by the evasion techniques.