• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
PythonForBeginners.com

PythonForBeginners.com

Learn By Example

  • Home
  • Learn Python
    • Python Tutorial
  • Categories
    • Basics
    • Lists
    • Dictionary
    • Code Snippets
    • Comments
    • Modules
    • API
    • Beautiful Soup
    • Cheatsheet
    • Games
    • Loops
  • Python Courses
    • Python 3 For Beginners
You are here: Home / Beautiful Soup / Scraping websites with Python

Scraping websites with Python

Author: PFB Staff Writer
Last Updated: August 28, 2020

What is BeautifulSoup?

BeautifulSoup is a third party Python library from Crummy.

The library is designed for quick turnaround projects like screen-scraping

What can it do?

Beautiful Soup parses anything you give it and does the tree traversal
stuff for you.

You can use it to find all the links of a website

Find all the links whose urls match “foo.com”

Find the table heading that’s got bold text, then give me that text.

Find every “a” element that has an href attribute etc.

What do I need?

You need to first install the BeautifulSoup module and then import the
module into your script.

You can install it with pip install beautifulsoup4 or easy_install beautifulsoup4.

It’s also available as the python-beautifulsoup4 package in recent versions of
Debian and Ubuntu.

Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.

BeautifulSoup Examples

Before we start, we have to import two modules => BeutifulSoup and urllib2

Urlib2 is used to open the URL we want.

Since BeautifulSoup is not getting the web page for you, you will have to use
the urllib2 module to do that.

#import the library used to query a website
import urllib2

Search and find all html tags

We will use the soup.findAll method to search through the soup object to
match for text and html tags within the page.

from BeautifulSoup import BeautifulSoup

import urllib2 
url = urllib2.urlopen("http://www.python.org")

content = url.read()

soup = BeautifulSoup(content)

links = soup.findAll("a")

That will print out all the elements in python.org with an “a” tag.

That is the tag that defines a hyperlink, which is used to link from one page
to another

Find all links on Reddit

Fetch Reddit webpage’s HTML by using Python’s built-in urllib2 module.

Once we have the actual HTML for the page, we create a new BeautifulSoup
class to take advantage of its simple API.

from BeautifulSoup import BeautifulSoup

import urllib2

pageFile = urllib2.urlopen("http://www.reddit.com")

pageHtml = pageFile.read()

pageFile.close()

soup = BeautifulSoup("".join(pageHtml))

#sAll = soup.findAll("li")

sAll = soup.findAll("a")

for href in sAll:
    print href

Website Scrap the Huffington Post

Here is another example I saw on newthinktank.com

from urllib import urlopen

from BeautifulSoup import BeautifulSoup

import re

# Copy all of the content from the provided web page
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/LatestNews').read()

# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('')

# Grab the link to the original article using a REGEX
patFinderLink = re.compile('')

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)

findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []

listIterator[:] = range(2,16)

soup2 = BeautifulSoup(webpage)

#print soup2.findAll("title")

titleSoup = soup2.findAll("title")

linkSoup = soup2.findAll("link")

for i in listIterator:
    print titleSoup[i]
    print linkSoup[i]
    print "
"
More Reading

http://www.crummy.com/software/BeautifulSoup/
http://www.newthinktank.com/2010/11/python-2-7-tutorial-pt-13-website-scraping/
http://kochi-coders.com/?p=122

Related

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

Filed Under: Beautiful Soup, Python On The Web, urllib2 Author: PFB Staff Writer

More Python Topics

API Argv Basics Beautiful Soup Cheatsheet Code Code Snippets Command Line Comments Concatenation crawler Data Structures Data Types deque Development Dictionary Dictionary Data Structure In Python Error Handling Exceptions Filehandling Files Functions Games GUI Json Lists Loops Mechanzie Modules Modules In Python Mysql OS pip Pyspark Python Python On The Web Python Strings Queue Requests Scraping Scripts Split Strings System & OS urllib2

Primary Sidebar

Menu

  • Basics
  • Cheatsheet
  • Code Snippets
  • Development
  • Dictionary
  • Error Handling
  • Lists
  • Loops
  • Modules
  • Scripts
  • Strings
  • System & OS
  • Web

Get Our Free Guide To Learning Python

Most Popular Content

  • Reading and Writing Files in Python
  • Python Dictionary – How To Create Dictionaries In Python
  • How to use Split in Python
  • Python String Concatenation and Formatting
  • List Comprehension in Python
  • How to Use sys.argv in Python?
  • How to use comments in Python
  • Try and Except in Python

Recent Posts

  • Count Rows With Null Values in PySpark
  • PySpark OrderBy One or Multiple Columns
  • Select Rows with Null values in PySpark
  • PySpark Count Distinct Values in One or Multiple Columns
  • PySpark Filter Rows in a DataFrame by Condition

Copyright © 2012–2025 · PythonForBeginners.com

  • Home
  • Contact Us
  • Privacy Policy
  • Write For Us