Using Selenium With Python in a Docker Container

02.08.2020 — web-scraping, python, docker, selenium, testing — 3 min read

While web scraping, I came across many useful applications such as listing old prices of some financial assets or finding current news topics. Although those examples are quite interesting to apply, frequently there was one main goal to reach at the end that is creating a database with the scraped information.

Whenever I went a bit further on scraping, I ended up in the websites using Javascript to display the data that I needed. Hence, I bumped into Selenium, which is a web testing and automation tool. In this small write up, I aim to list some steps that I find quite useful while setting up Selenium within a Docker container.

Introduction to Selenium WebDriver

Selenium WebDriver is a web automation or testing tool. It was created by Simon Stewart in 2006, as the first cross-platform testing framework that could control the browser from the OS level.

So with Selenium, I can run some automated actions on browsers (clicks, hovers, and fill forms) by directly communicating with them. Java, C# , PHP, Python, Perl, Go and Ruby are the supported languages for the bindings. Since I am more familiar with Python, I will be talking about it.

To work on a browser, I need to choose among a set of browser options like Firefox, Chrome (Chromium), Edge, and Safari. As a personal opinion, Chrome with a headless option (not generating a user interface) is the most performant one, hence I will be sticking to that.

Pulling the Image and Setting Up Google Chrome

To start with my custom Selenium-Python image, I need a Python image, here in this write-up I picked up the version 3.8.

Then I can install Google Chrome on top of it. Remember, without the Google Chrome itself, I cannot run Selenium on top of it to run our tasks. There are a few steps to apply for setting up Google Chrome in Linux:

Adding Google Chrome trusting keys to apt
Adding Google Chrome stable version to the repositories
Updating the repositories to see the stable version in apt
Installing google-chrome-stable

1FROM python:3.8
2
3# Adding trusting keys to apt for repositories
4RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
5
6# Adding Google Chrome to the repositories
7RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
8
9# Updating apt to see and install Google Chrome
10RUN apt-get -y update
11
12# Magic happens
13RUN apt-get install -y google-chrome-stable

Installing Chrome Driver

Selenium requires a driver interface to work with the defined browser. Hence, I need to find a way to install Chrome Driver in our Linux image. Here are the steps to follow for doing this:

Installing unzip as we will need for the zipped Chrome Driver
Download the Chrome Driver into a folder called /tmp/chromedriver.zip , this name can be changed
Unzipping the /tmp/chromedriver.zip into the Linux executable path

After those steps, I need to set the display port (99) as Selenium is using this. It will avoid some crushes.

1# Installing Unzip
2RUN apt-get install -yqq unzip
3
4# Download the Chrome Driver
5RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`
6curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE
7`/chromedriver_linux64.zip
8
9# Unzip the Chrome Driver into /usr/local/bin directory
10RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
11
12# Set display port as an environment variable
13ENV DISPLAY=:99

Preparing the Docker for a Run

All the steps above were only for setting up Chrome in our Dockerfile. To run my Python application ( app.py ) using Docker, I might need the following lines into our Dockerfile.

1COPY . /app
2WORKDIR /app
3
4RUN pip install --upgrade pip
5
6RUN pip install -r requirements.txt
7
8CMD ["python", "./app.py"]

Apart from those Docker settings, I would like to briefly mention some Docker specific chrome options while setting up the Chrome Driver via Python. I want to explicitly show those a few options in one function as set_chrome_options . Here I set up the example pseudocode with a function below. I need 4 specific arguments to run our Chrome Driver inside Docker:

Explicitly saying that this is a headless application with --headless
Explicitly bypassing the security level in Docker with --no-sandbox . There is a nice Stackoverflow thread over this, apparently as Docker deamon always runs as a root user, Chrome crushes.
Explicitly disabling the usage of /dev/shm/ . The /dev/shm partition is too small in certain VM environments, causing Chrome to fail or crash.
Disabling the images with chrome_prefs["profile.default_content_settings"] = {"images": 2} .

1from selenium.webdriver.chrome.options import Options
2from selenium import webdriver
3
4def set_chrome_options() -> Options:
5    """Sets chrome options for Selenium.
6    Chrome options for headless browser is enabled.
7    """
8    chrome_options = Options()
9    chrome_options.add_argument("--headless")
10    chrome_options.add_argument("--no-sandbox")
11    chrome_options.add_argument("--disable-dev-shm-usage")
12    chrome_prefs = {}
13    chrome_options.experimental_options["prefs"] = chrome_prefs
14    chrome_prefs["profile.default_content_settings"] = {"images": 2}
15    return chrome_options
16
17if __name__ == "__main__":
18    driver = webdriver.Chrome(options=set_chrome_options())
19    # Do stuff with your driver
20    driver.close()

Last Words

Here is the Dockerfile, that I took as an example. While creating this, I used the links that I shared to solve the problems that I faced. There might be other kinds of solutions to the problems that I faced. I am curious to listen to those.

Until now, I used it to scrape web archives for asset prices, books, yellow pages, and judgment texts. Although Selenium is not designed for web scraping, I leveraged this nice tool for taming Javascript using websites. But I should admit that, if the information that I was looking for was not hiding in Javascript, I would have been definitely a lot happier with using only Requests , BeautifulSoup4 and/or Scrapy for Python. Because all those are simpler to set up, and more performant.

Happy Scraping!

This post is also available on DEV.