Simple Selenium Chrome Crawler (Python)
In this article, we show how you can create a simple crawler in Python that leverages Google Chrome and Selenium. The crawler we create will be able to take as input a list of urls to crawl, and to save as output the list of links it encountered during the crawl.Installing Selenium and Chromedriver
First, go to https://chromedriver.chromium.org/downloads to download the chromedriver version associated with your Google Chrome version. If you don't know which version of Chrome you are using, click on Chrome menu ("..."" icon on the right of your screen), then click on "Help" and "About Google Chrome".The file you download is a zip file containing the chromedriver executable. Go to the directory you have created for your crawler, and extract the zip file in it. After it has been extracted, you should obtain a file called webdriver or webdriver.exe depending on your OS.To install Selenium, you can simply run one of the following commands:pip install seleniumpip3 install selenium
Loading a page with Selenium and Chrome
To ensure everything's installed properly, we create a simple version of our crawler that only loads a page, waits 5 seconds, and then closes the browser.from selenium import webdriverimport time
driver = webdriver.Chrome('./chromedriver') # or chromedriver.exe on windows
driver.get('http://www.device-info.fr/')
time.sleep(5) # waits 5 seconds
driver.quit() # closes the browser
Run your program and verify and that it opens http://www.device-info.fr/.Crawling a list of urls with Selenium Chrome
Now that we verified that Selenium and chromedriver are properly installed, we modify our crawler to add more features. We create a `read_urls` function takes as input the path to a file containing a list of urls to crawl.def read_urls(file_path): urls = []
with open(file_path, 'r') as file_urls:
for line in file_urls:
url = line.replace("\n", "")
urls.append(url)
return urls
We also create a text file called urls.txt that contains 5 urls to crawl:https://google.comhttps://facebook.com
https://www.ft.com
https://www.booking.com
https://news.ycombinator.com
We modify the code of our crawler so that it reads the file containing the urls, and then, for each url,
visits the page, gets all the links present on the page, and save them in a file.
In order to get the list of links in a page, we use driver.find_elements_by_tag_name('a')
.urls = read_urls('./urls.txt') # read the list of urls to crawldriver = webdriver.Chrome('./chromedriver')
links_crawled = []
for idx, url in enumerate(urls):
print("Crawling {} ({}/{})".format(url, idx + 1, len(urls)))
driver.get(url)
a_elts = driver.find_elements_by_tag_name('a')
for a_elt in a_elts:
links_crawled.append(a_elt.get_attribute('href'))
driver.quit()
Finally, we create a function to save the links crawled in a file:
def save_links_crawled(links_crawled, file_path): with open(file_path, 'w+') as file_links:
for link in links_crawled:
file_links.write("{}\n".format(link))
We just need to call it after we close the browser.
save_links_crawled(links_crawled, './links_crawled.txt')
It generates a file called
links_crawled.txt
that contains the lists of links crawled:https://www.ft.com/globetrotterhttps://www.ft.com/tech-scroll-asia
https://www.ft.com/moral-money
…
http://www.ycombinator.com/legal/
http://www.ycombinator.com/apply/
mailto:hn@ycombinator.com
Other recommended articles
Investigating the Selenium Chrome mode of Open Bullet 2
Fourth article of a series about Open Bullet 2, a credential stuffing tool. We analyze the the Selenium Chrome mode to better understand how it works, its browser fingerprint, and how it can be detected.
Published on: 05-09-2024
Scraping thousands of temporary (disposable) phone numbers
Temporary phone numbers are virtual numbers used for a short period, allowing users to receive calls and messages without revealing their personal number. In this article we create a scraper to download more than 5,000 temporary numbers and 393K messages they received.
Published on: 12-05-2024
How to detect (modified, headless) Chrome instrumented with Selenium (2024 edition)
In this article, we present 4 efficient techniques to detect bots that leverage Selenium with headless and non-headless Chrome. These techniques have been tested in June 2024.
Published on: 23-06-2024