May. 23, 2013

Web Scraping with BeautifulSoup

Web Scraping

"Web scraping (web harvesting or web data extraction) is a computer software
technique of extracting information from websites."

HTML parsing is easy in Python, especially with help of the BeautifulSoup library. 
In this post we will scrape a website (our own) to extract all URL's. 

Getting Started

To begin with, make sure that you have the necessary modules installed. 

In the example below, we are using Beautiful Soup 4 and Requests on a system with
Python 2.7 installed.
Installing BeautifulSoup and Requests can be done with pip:
$ pip install requests

$ pip install beautifulsoup4 

What is Beautiful Soup?

On the top of their website, you can read: "You didn't write that awful page.
You're just trying to get some data out of it. Beautiful Soup is here to help.
Since 2004, it's been saving programmers hours or days of work on quick-turnaround
screen scraping projects."
Beautiful Soup Features:

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating,
searching, and modifying a parse tree: a toolkit for dissecting a document and
extracting what you need. It doesn't take much code to write an application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing
documents to UTF-8. You don't have to think about encodings, unless the document
doesn't specify an encoding and Beautiful Soup can't autodetect one. 

Then you just have to specify the original encoding.

Beautiful Soup sits on top of popular Python parsers like lxml and html5lib,
allowing you to try out different parsing strategies or trade speed for
flexibility. 

Extracting URL's from any website

Now when we know what BS4 is and we have installed it on our machine,
let's see what we can do with it.
from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get("http://" +url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))
When we run this program, it will ask us for a website to extract the URL's from
Enter a website to extract the URL's from: www.pythonforbeginners.com
http://www.pythonforbeginners.com
http://www.pythonforbeginners.com/python-overview-start-here/
http://www.pythonforbeginners.com/dictionary/
http://www.pythonforbeginners.com/python-functions-cheat-sheet/
http://www.pythonforbeginners.com/python-lists-cheat-sheet/
http://www.pythonforbeginners.com/loops/
http://www.pythonforbeginners.com/python-modules/
http://www.pythonforbeginners.com/strings/
http://www.pythonforbeginners.com/sitemap/
http://www.pythonforbeginners.com/feed/
http://www.pythonforbeginners.com
....
....
....
I recommend that you read our introduction article: "Beautiful Soup 4 Python"
found here to get more knowledge and understanding about Beautiful Soup.
More Reading
http://www.crummy.com/software/BeautifulSoup/
http://docs.python-requests.org/en/latest/index.html


Read more about: