36
Posted November 28, 2009 by Spyros in Python Programming
 
 

How to Use The Urllib Python Library to Fetch URL Data and More

optimize-url-structure
optimize-url-structure

I will claim it. If you’re not using Python, you are losing a lot. At that point, i understand that some of you will argue that with saying how nice perl is or how great things work at ruby on rails. Well, get to know all scripting worlds if you like. Then, come back to Python programming.

I’m not trying to preach you about how good Python is (hope you know that already :P), but your being knowledgeable and using other scripting languages is fair enough. Bear with me on that though. It would be a really really good idea to take up Python if you’ve never done that before. I can assure you that it will be well worth it. Alas, if you wanted more preaching you could have read the 5 reasons why you should learn Python, you say angrily. You’re right, sorry.

To the point now. One of the very good things about Python is that you can find lots of premade libraries for all sorts of things that you may want to do. I must admit that the documentation is most times not as good as Perl’s very organized CPAN, but it still consists of some great reusable code that makes your job much easier.

In your scripting endeavours, there will be many times where you will need to fetch some data from a website. In my occasion, one of the best uses for that, was creating a script that automated the login and whole playing procedure for a well known browser game. YES, i am a cheater. But come on i did it for practice (not really, but i have to defend myself somehow).

In that occasion, i needed to do two things. First, i needed to log in to the game. Then, after doing that, i needed to grab certain webpages and send back certain requests that would automate procedures like building army, new constructions and such. Generally speaking, getting a premium service at my own expense. The tricky part about it is cookies. The cookie, in that sense, is a way for the browser to know when your session with the server ends. Thus, if it notices inactivity for a certain period of time (usually 15 minutes), your session expires and you get a message like “please log in”.

Therefore, you need to be using urllib in such a way that it creates one or more cookies as well and handles them. But before messing with the cookies, let’s first check a much simpler example of using urllib to fetch data from a certain URL, without having to create any cookies:


import urllib2

def getPage():
    url="http://www.whatever.com"

    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    return response.read()

if __name__ == "__main__":
    namesPage = getPage()
    print namesPage

This is a very simple but effective example. The first thing that we need to do is call the urllib2 function named Request(). We invoke it using the url as a parameter and we get a request object back. This is used in its simplest form in the meaning. You could also be specifying http headers and data, but we will keep it simple for this first example. Then, you need to call the urlopen function that returns a file object to the actual received that. After that, you simply use the standard python read() function to read the whole file’s contents, get that and print it out.

Messing With Form Parameters, Cookies and HTTP Headers

While this is a good example of the core use of urllib2, it’s not used at its full potential here. Let’s see a more complicated example :

import urllib
import urllib2
import ClientCookie

def login():
    url = "http://domain.com/login.php"  # example url

    opts = {
    'user': 'yourUSERNAME',
    'password': 'yourPASS',
    'server': 'serverName'
    }

    data = urllib.urlencode(opts)

    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
    'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
    'Accept-Language': 'en-gb,en;q=0.5',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
    'Connection': 'keep-alive'
    }

    req = urllib2.Request(url, data, headers)

    response = ClientCookie.urlopen(req)
    return response.read()

This is the actual login function (a bit edited), that i was using for the script i mentioned beforehand. It’s a bit more complicated, but it resembles a real world script. Let’s descramble it together. Notice that this time we use two more parameters for the Request function of urllib2. The first one is the url data. Since we want to login to the game, we have to provide our credentials. Our name, password and the server at which we play. You can get the names of the actual form elements easily using firefox’s great Web Developer plugin.

As you see, opts is actually a dictionary with the names and values of each important parameter that we need to pass to the server. This dictionary is then passed to urlencode that does what it says. It url encoded the data to properly be passed onto the server. Then, to make this a bit more professional, we disguise ourselves to be Mozilla Firefox. Nobody would want to suspect that we are a bot right ? Firefox headers look much more casual, don’t you think ?

After that, the procedure is almost the same. We just use ClientCookie as a means to get the webpage and invoke urlopen. What this does is allow python to store a session cookie for us. This is it now. We are in a situation like being logged in the actual game and we can do pretty much everything now :)

Multiple Cookies Handling

There has been a situation for me, when i needed multiple cookie handling and ClientCookie was just not doing that. I thought that it would be good to let you know of that occasion as well. Thus, this is the final example of this tutorial, that specifies how you can go about handling multiple cookies :


import urllib
import urllib2
import cookielib

def login():
    url = "https://whatever.com/login.php"
    opts = {
        'email': 'emailaddr',
        'pass': 'password',
    }

    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12',
      'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
      'Accept-Language': 'en-gb,en;q=0.5',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
      'Connection': 'keep-alive'
    }

    data = urllib.urlencode(opts)
    request = urllib2.Request(url, data, headers)

    cookies = cookielib.CookieJar()
    cookies.extract_cookies(response,request)

    cookie_handler= urllib2.HTTPCookieProcessor( cookies )
    redirect_handler= urllib2.HTTPRedirectHandler()
    opener = urllib2.build_opener(redirect_handler,cookie_handler)

    response = opener.open(request)

    return response.read()

if __name__ == "__main__":
    print login()

As you see, this is pretty different than the previous examples. We use cookielib to create what is called a cookie jar that will store multiple cookies for us. Then, build_opener() is used to combine the http handler with the cookies handler and after that things are back to normal. Using the opener object, we just execute a simple open to that obect in order to get a file object with the response of the server.

Make sure that you use this in a good manner and don’t go about creating bots ! Or maybe you think it’s a good way to practice, huh ? I second that.


Spyros