• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
PythonForBeginners.com

PythonForBeginners.com

Learn By Example

  • Home
  • Learn Python
    • Python Tutorial
  • Categories
    • Basics
    • Lists
    • Dictionary
    • Code Snippets
    • Comments
    • Modules
    • API
    • Beautiful Soup
    • Cheatsheet
    • Games
    • Loops
  • Python Courses
    • Python 3 For Beginners
You are here: Home / Python On The Web / Web Scraping with BeautifulSoup

Web Scraping with BeautifulSoup

Author: PFB Staff Writer
Last Updated: January 31, 2021

Web Scraping

“Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.”

HTML parsing is easy in Python, especially with help of the BeautifulSoup library. In this post we will scrape a website (our own) to extract all URL’s.

Getting Started

To begin with, make sure that you have the necessary modules installed. In the example below, we are using Beautiful Soup 4 and Requests on a system with Python 2.7 installed. Installing BeautifulSoup and Requests can be done with pip:


$ pip install requests

$ pip install beautifulsoup4

What is Beautiful Soup?

On the top of their website, you can read: “You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects.”

Beautiful Soup Features:

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t autodetect one. Then you just have to specify the original encoding.

Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Extracting URL’s from any website

Now when we know what BS4 is and we have installed it on our machine, let’s see what we can do with it.


from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get("http://" +url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

When we run this program, it will ask us for a website to extract the URL’s from


Enter a website to extract the URL's from: www.pythonforbeginners.com
Learn Python By Example
https://www.pythonforbeginners.com/python-overview-start-here/ https://www.pythonforbeginners.com/dictionary/ https://www.pythonforbeginners.com/python-functions-cheat-sheet/
Lists
https://www.pythonforbeginners.com/loops/ https://www.pythonforbeginners.com/python-modules/ https://www.pythonforbeginners.com/strings/ https://www.pythonforbeginners.com/sitemap/ https://www.pythonforbeginners.com/feed/
Learn Python By Example
.... .... ....

I recommend that you read our introduction article: Beautiful Soup 4 Python to get more knowledge and understanding about Beautiful Soup.

More Reading

http://www.crummy.com/software/BeautifulSoup/

http://docs.python-requests.org/en/latest/index.html

Related

Scraping websites with PythonDecember 15, 2012In "Beautiful Soup"

Python – Quick Start WebAugust 12, 2013In "Basics"

Beautiful Soup 4 PythonMarch 9, 2016In "Beautiful Soup"

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

Filed Under: Beautiful Soup, Python On The Web, Requests Author: PFB Staff Writer

More Python Topics

API Argv Basics Beautiful Soup Cheatsheet Code Code Snippets Command Line Comments Concatenation crawler Data Structures Data Types deque Development Dictionary Dictionary Data Structure In Python Error Handling Exceptions Filehandling Files Functions Games GUI Json Lists Loops Mechanzie Modules Modules In Python Mysql OS pip Pyspark Python Python On The Web Python Strings Queue Requests Scraping Scripts Split Strings System & OS urllib2

Primary Sidebar

Menu

  • Basics
  • Cheatsheet
  • Code Snippets
  • Development
  • Dictionary
  • Error Handling
  • Lists
  • Loops
  • Modules
  • Scripts
  • Strings
  • System & OS
  • Web

Get Our Free Guide To Learning Python

Most Popular Content

  • Reading and Writing Files in Python
  • Python Dictionary – How To Create Dictionaries In Python
  • How to use Split in Python
  • Python String Concatenation and Formatting
  • List Comprehension in Python
  • How to Use sys.argv in Python?
  • How to use comments in Python
  • Try and Except in Python

Recent Posts

  • Count Rows With Null Values in PySpark
  • PySpark OrderBy One or Multiple Columns
  • Select Rows with Null values in PySpark
  • PySpark Count Distinct Values in One or Multiple Columns
  • PySpark Filter Rows in a DataFrame by Condition

Copyright © 2012–2025 · PythonForBeginners.com

  • Home
  • Contact Us
  • Privacy Policy
  • Write For Us