Sep. 22, 2012

Python Code : Get all the links from a website

Overview

In this script, we are going to use the re module to get all links from any website. 

One of the most powerful function in the re module is "re.findall()".

While re.search() is used to find the first match for a pattern, re.findall() finds *all*
the matches and returns them as a list of strings, with each string representing one match.

Get all links from a website

This example will get all the links from any websites HTML code. 

To find all the links, we will in this example use the urllib2 module together
with the re.module
import urllib2
import re

#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links
Happy scraping!

Recommended Python Training – DataCamp

For Python training, our top recommendation is DataCamp.

Datacamp provides online interactive courses that combine interactive coding challenges with videos from top instructors in the field.

Datacamp has beginner to advanced Python training that programmers of all levels benefit from.

 



Read more about:
Disclosure of Material Connection: Some of the links in the post above are “affiliate links.” This means if you click on the link and purchase the item, I will receive an affiliate commission. Regardless, PythonForBeginners.com only recommend products or services that we try personally and believe will add value to our readers.