Simple Selenium Chrome Crawler (Python)

In this article, we show how you can create a simple crawler in Python that leverages Google Chrome and Selenium. The crawler we create will be able to take as input a list of urls to crawl, and to save as output the list of links it encountered during the crawl.

Installing Selenium and Chromedriver

First, go to https://chromedriver.chromium.org/downloads to download the chromedriver version associated with your Google Chrome version. If you don't know which version of Chrome you are using, click on Chrome menu ("..."" icon on the right of your screen), then click on "Help" and "About Google Chrome".

The file you download is a zip file containing the chromedriver executable. Go to the directory you have created for your crawler, and extract the zip file in it. After it has been extracted, you should obtain a file called webdriver or webdriver.exe depending on your OS.

To install Selenium, you can simply run one of the following commands:
pip install seleniumpip3 install selenium

Loading a page with Selenium and Chrome

To ensure everything's installed properly, we create a simple version of our crawler that only loads a page, waits 5 seconds, and then closes the browser.
from selenium import webdriverimport time
 
driver = webdriver.Chrome('./chromedriver') # or chromedriver.exe on windows
driver.get('http://www.device-info.fr/') 
time.sleep(5) # waits 5 seconds 
driver.quit() # closes the browser

Run your program and verify and that it opens http://www.device-info.fr/.

Crawling a list of urls with Selenium Chrome

Now that we verified that Selenium and chromedriver are properly installed, we modify our crawler to add more features. We create a `read_urls` function takes as input the path to a file containing a list of urls to crawl.
def read_urls(file_path):  urls = []
  with open(file_path, 'r') as file_urls:
      for line in file_urls:
          url = line.replace("\n", "")
          urls.append(url)
  return urls

We also create a text file called urls.txt that contains 5 urls to crawl:
https://google.comhttps://facebook.com
https://www.ft.com
https://www.booking.com
https://news.ycombinator.com

We modify the code of our crawler so that it reads the file containing the urls, and then, for each url, visits the page, gets all the links present on the page, and save them in a file. In order to get the list of links in a page, we use driver.find_elements_by_tag_name('a').
urls = read_urls('./urls.txt') # read the list of urls to crawldriver = webdriver.Chrome('./chromedriver')
 
links_crawled = []
for idx, url in enumerate(urls):
  print("Crawling {} ({}/{})".format(url, idx + 1, len(urls)))
  driver.get(url)
  a_elts = driver.find_elements_by_tag_name('a')
  for a_elt in a_elts:
      links_crawled.append(a_elt.get_attribute('href'))
 
driver.quit()

Finally, we create a function to save the links crawled in a file:

def save_links_crawled(links_crawled, file_path):  with open(file_path, 'w+') as file_links:
      for link in links_crawled:
          file_links.write("{}\n".format(link))

We just need to call it after we close the browser.

save_links_crawled(links_crawled, './links_crawled.txt')

It generates a file called links_crawled.txt that contains the lists of links crawled:
https://www.ft.com/globetrotterhttps://www.ft.com/tech-scroll-asia
https://www.ft.com/moral-money
…
http://www.ycombinator.com/legal/
http://www.ycombinator.com/apply/
mailto:hn@ycombinator.com

Other recommended articles

Scraping thousands of temporary (disposable) phone numbers

Temporary phone numbers are virtual numbers used for a short period, allowing users to receive calls and messages without revealing their personal number. In this article we create a scraper to download more than 5,000 temporary numbers and 393K messages they received.

Read more

Published on: 12-05-2024

How to detect (modified|headless) Chrome instrumented with Selenium (2024 edition)

In this article, we present 4 efficient techniques to detect bots that leverage Selenium with headless and non-headless Chrome. These techniques have been tested in June 2024.

Read more

Published on: 23-06-2024

How to remove “Chrome is being controlled by automated test software” ?

In this article, we present how you can remove the “Chrome is being controlled by automated test software” warning in Chrome using the ignoreDefaultArgs: ["--enable-automation"] argument.

Read more

Published on: 16-06-2024