Wednesday, March 20, 2019

Selenium + Firefox on Headless Debian

Today I ran into an issue while working on a project for a client. I was tasked to write a web scraper to grab data off a website because there are no descent API's in existence that are "good enough" and the client wanted the data directly from the most well known source.

After spending twelve hours writing this thing, I finally got it working flawlessly. I sent it to the client, whom couldn't get it to work, and we were both a little bewildered. It turns out that the issue was that in order to use Firefox as your scraping browser, you... well, you need to have Firefox installed on your server, headless though it may be.

For my own future reference I figured I'd note how to configure that here. Perhaps this post will help someone else out as well. These instructions are intended particularly for Debian (not everyone uses Ubuntu!).

First, you need selenium:

$ sudo -i
# apt update ; apt install python3-selenium firefox-esr

Next, you might need geckodriver. I am not sure because i installed it manually,  but you might as well grab the latest version:

# cd /usr/local/src
# wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
# tar -xzf geckodriver-v0.24.0-linux64.tar.gz
# cp geckodriver /usr/local/bin

 Finally, let's fire up a test webdriver and see if it works:

# exit
$ echo -e 'from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
print('If you see this, it worked.')
driver.close()' > test_selenium.py

$ python3 test_selenium.py
If you see this, it worked.
That's all there is to it.