Oct. 14, 2013

Using Feedparser in Python

Overview

In this post we will take a look on how we can download and parse syndicated
feeds with Python. 

The Python module we will use for that is "Feedparser".

The complete documentation can be found here.

What is RSS?

RSS stands for Rich Site Summary and uses standard web feed formats to publish
frequently updated information: blog entries, news headlines, audio, video. 

An RSS document (called "feed", "web feed", or "channel") includes full or
summarized text, and metadata, like publishing date and author's name. [source]

What is Feedparser?

Feedparser is a Python library that parses feeds in all known formats, including 
Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]

RSS Elements

Before we install the feedparser module and start to code, let's take a look
at some of the available RSS elements.

The most commonly used elements in RSS feeds are "title", "link", "description",
"publication date", and "entry ID". 

The less commonnly used elements are "image", "categories", "enclosures"
and "cloud". 

Install Feedparser

To install feedparser on your computer, open your terminal and install it using
"pip" (A tool for installing and managing Python packages)
sudo pip install feedparser
To verify that feedparser is installed, you can run a "pip list".

You can of course also enter the interactive mode, and import the feedparser
module there. 

If you see an output like below, you can be sure it's installed.
>>> import feedparser
>>>
Now that we have installed the feedparser module, we can go ahead and begin
to work with it. 

Getting the RSS feed

You can use any RSS feed that you want. Since I like to read Reddit, I will use
that for my example. 

Reddit is made up of many sub-reddits, the one I am particular interested in for
now is the "Python" sub-reddit. 

The way to get the RSS feed, is just to look up the URL to that sub-reddit and
add a ".rss" to it. 

The RSS feed that we need for the python sub-reddit would be:
http://www.reddit.com/r/python/.rss

Using Feedparser

You start your program with importing the feedparser module.
import feedparser
Create the feed. Put in the RSS feed that you want. 
d = feedparser.parse('http://www.reddit.com/r/python/.rss')
The channel elements are available in d.feed (Remember the "RSS Elements" above)

The items are available in d.entries, which is a list.

You access items in the list in the same order in which they appear in the
original feed, so the first item is available in d.entries[0].
Print the title of the feed
print d['feed']['title']

>>> Python
Resolves relative links
print d['feed']['link']

>>> http://www.reddit.com/r/Python/
Parse escaped HTML
print d.feed.subtitle

>>> news about the dynamic, interpreted, interactive, object-oriented, extensible
programming language Python
See number of entries
print len(d['entries'])

>>> 25
Each entry in the feed is a dictionary. Use [0] to print the first entry.
print d['entries'][0]['title'] 

>>> Functional Python made easy with a new library: Funcy
Print the first entry and its link
print d.entries[0]['link'] 

>>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_
library/
Use a for loop to print all posts and their links.
for post in d.entries:
    print post.title + ": " + post.link + "\n"

>>>
Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/
comments/1oej74/functional_python_made_easy_with_a_new_
library/

Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/
python_packages_open_sourced/

PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/
pyeda_0150_released/

PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/
pymongo_263_released/
.....
.......
........
Reports the feed type and version
print d.version      

>>> rss20
Full access to all HTTP headers
print d.headers          	

>>> 
{'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server':
"'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34
GMT', 'content-type': 'text/xml; charset=UTF-8'}
Just get the content-type from the header
print d.headers.get('content-type')

>>> text/xml; charset=UTF-8
Using the feedparser is an easy and fun way to parse RSS feeds. 

Sources

http://www.slideshare.net/LindseySmith1/feedparser
http://code.google.com/p/feedparser/


Read more about: