PhantomJS and Selenium -- Headless Browser Spider

This article introduce how to use PhantomJS and Selenium to do headless Browser Testing and web spider

For the webspider, many times we encounter some annoying websites and found it is hard to crawler the data easily. So we need to simulate browser to do it. Selenium is a very powerful tool to help us crawling data. But Selenium also have some shortcomings, for example in linux and other cloud system, it is not easy to install a browser to do it. Another thing is that usually starting a browser is much less efficient to do the scrapy things. This article is intended to introduce PhantomJS and Selenium , which will help developer to do the browser testing quickly and web spider efficiently.

intall the required software

first is to install selenium
pip intall selenium

for phantomJS, we can use brew or use npm (Node.js) to install:
npm -g install phantomjs-prebuilt

Note my node module is in “C:\Users\username\AppData\Roaming\npm\node_modules”

Once we are done with this, we can use PhantomJS freely in selenium

1
2
3
4
5
6
7
8
9
## python 3.5
from selenium import webdriver
driver = webdriver.PhantomJS() ## put PhantomJS.exe in the same directory
driver.set_window_size(1120, 550)
driver.get("https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/")
temp=driver.find_element_by_xpath("//h3/following-sibling::p") ## find the following sibling haha
driver.save_screenshot('screen.png') # save a screenshot to disk
print(temp.text)
driver.quit()

We can see it is headless browser and give us results directly. (personally speaking, I do not think it is very fast. It seems still spend long time)

reference