Oct. 15, 2012

Fetching data from the Internet

What is Urllib2?

urllib2 is a Python module for fetching URLs. 

What can it do?

It offers a very simple interface, in the form of the urlopen function. 

Urlopen is capable of fetching URLs using a variety of different protocols like
(http, ftp, file). 

It also offers an interface for handling basic authentication, cookies, proxies
and so on. 

These are provided by objects called handlers and openers. 

HTTP Requests

HTTP is based on requests and responses, in that the client makes requests and
the servers send responses. 

This response is a file-like object, which means you can for example call .read()
on the response. 

How can I use it?

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

User Agents

You can also add your own headers with urllib2. 

Some websites dislike being browsed by programs. 

By default urllib2 identifies itself as Python-urllib/x.y (where x and y are
the major and minor version numbers of the Python release, 
which may confuse the site, or just plain not work. 

The way a browser identifies itself is through the User-Agent header.

Please see this post: 
https://www.pythonforbeginners.com/network/python-modules-urllib2-user-agent/ 
that describes how to use that in a program.

Get HTTP Headers

Let's write a small script that will get the HTTP headers from a website. 
import urllib2
response = urllib2.urlopen("http://www.python.org")
print "-" * 20
print "URL : ", response.geturl()

headers = response.info()
print "-" * 20
print "This prints the header: ", headers
print "-" * 20
print "Date :", headers['date']
print "-" * 20
print "Server Name: ", headers['Server']
print "-" * 20
print "Last-Modified: ", headers['Last-Modified']
print "-" * 20
print "ETag: ", headers['ETag']
print "-" * 20
print "Content-Length: ", headers['Content-Length']
print "-" * 20
print "Connection: ", headers['Connection']
print "-" * 20
print "Content-Type: ", headers['Content-Type']
print "-" * 20
Will give an output similar to this:
--------------------
URL :  http://www.python.org
--------------------
This prints the header:  Date: Fri, 12 Oct 2012 08:09:40 GMT
Server: Apache/2.2.16 (Debian)
Last-Modified: Thu, 11 Oct 2012 22:36:55 GMT
ETag: "105800d-4de0-4cbd035514fc0"
Accept-Ranges: bytes
Content-Length: 19936
Vary: Accept-Encoding
Connection: close
Content-Type: text/html

--------------------
Date : Fri, 12 Oct 2012 08:09:40 GMT
--------------------
Server Name:  Apache/2.2.16 (Debian)
--------------------
Last-Modified:  Thu, 11 Oct 2012 22:36:55 GMT
--------------------
ETag:  "105800d-4de0-4cbd035514fc0"
--------------------
Content-Length:  19936
--------------------
Connection:  close
--------------------
Content-Type:  text/html
--------------------

Share this article

Recommended Python Training – DataCamp

For Python training, our top recommendation is DataCamp.

Datacamp provides online interactive courses that combine interactive coding challenges with videos from top instructors in the field.

Datacamp has beginner to advanced Python training that programmers of all levels benefit from.

 

Download Our Free Guide To Learning Python

* indicates required

Read more about:
Disclosure of Material Connection: Some of the links in the post above are “affiliate links.” This means if you click on the link and purchase the item, I will receive an affiliate commission. Regardless, PythonForBeginners.com only recommend products or services that we try personally and believe will add value to our readers.