How to Use The Urllib Python Library to Fetch URL Data and More

I will claim it. If you’re not using Python, you are losing a lot. At that point, i understand that some of you will argue that with saying how nice perl is or how great things work at ruby on rails. Well, get to know all scripting worlds if you like. Then, come back to Python programming.
I’m not trying to preach you about how good Python is (hope you know that already :P), but your being knowledgeable and using other scripting languages is fair enough. Bear with me on that though. It would be a really really good idea to take up Python if you’ve never done that before. I can assure you that it will be well worth it. Alas, if you wanted more preaching you could have read the 5 reasons why you should learn Python, you say angrily. You’re right, sorry.
To the point now. One of the very good things about Python is that you can find lots of premade libraries for all sorts of things that you may want to do. I must admit that the documentation is most times not as good as Perl’s very organized CPAN, but it still consists of some great reusable code that makes your job much easier.
In your scripting endeavours, there will be many times where you will need to fetch some data from a website. In my occasion, one of the best uses for that, was creating a script that automated the login and whole playing procedure for a well known browser game. YES, i am a cheater. But come on i did it for practice (not really, but i have to defend myself somehow).
In that occasion, i needed to do two things. First, i needed to log in to the game. Then, after doing that, i needed to grab certain webpages and send back certain requests that would automate procedures like building army, new constructions and such. Generally speaking, getting a premium service at my own expense. The tricky part about it is cookies. The cookie, in that sense, is a way for the browser to know when your session with the server ends. Thus, if it notices inactivity for a certain period of time (usually 15 minutes), your session expires and you get a message like “please log in”.
Therefore, you need to be using urllib in such a way that it creates one or more cookies as well and handles them. But before messing with the cookies, let’s first check a much simpler example of using urllib to fetch data from a certain URL, without having to create any cookies:
import urllib2 def getPage(): url="http://www.whatever.com" req = urllib2.Request(url) response = urllib2.urlopen(req) return response.read() if __name__ == "__main__": namesPage = getPage() print namesPage
This is a very simple but effective example. The first thing that we need to do is call the urllib2 function named Request(). We invoke it using the url as a parameter and we get a request object back. This is used in its simplest form in the meaning. You could also be specifying http headers and data, but we will keep it simple for this first example. Then, you need to call the urlopen function that returns a file object to the actual received that. After that, you simply use the standard python read() function to read the whole file’s contents, get that and print it out.
Messing With Form Parameters, Cookies and HTTP Headers
While this is a good example of the core use of urllib2, it’s not used at its full potential here. Let’s see a more complicated example :
import urllib import urllib2 import ClientCookie def login(): url = "http://domain.com/login.php" # example url opts = { 'user': 'yourUSERNAME', 'password': 'yourPASS', 'server': 'serverName' } data = urllib.urlencode(opts) headers = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12', 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'Accept-Language': 'en-gb,en;q=0.5', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Connection': 'keep-alive' } req = urllib2.Request(url, data, headers) response = ClientCookie.urlopen(req) return response.read()
This is the actual login function (a bit edited), that i was using for the script i mentioned beforehand. It’s a bit more complicated, but it resembles a real world script. Let’s descramble it together. Notice that this time we use two more parameters for the Request function of urllib2. The first one is the url data. Since we want to login to the game, we have to provide our credentials. Our name, password and the server at which we play. You can get the names of the actual form elements easily using firefox’s great Web Developer plugin.
As you see, opts is actually a dictionary with the names and values of each important parameter that we need to pass to the server. This dictionary is then passed to urlencode that does what it says. It url encoded the data to properly be passed onto the server. Then, to make this a bit more professional, we disguise ourselves to be Mozilla Firefox. Nobody would want to suspect that we are a bot right ? Firefox headers look much more casual, don’t you think ?
After that, the procedure is almost the same. We just use ClientCookie as a means to get the webpage and invoke urlopen. What this does is allow python to store a session cookie for us. This is it now. We are in a situation like being logged in the actual game and we can do pretty much everything now
Multiple Cookies Handling
There has been a situation for me, when i needed multiple cookie handling and ClientCookie was just not doing that. I thought that it would be good to let you know of that occasion as well. Thus, this is the final example of this tutorial, that specifies how you can go about handling multiple cookies :
import urllib import urllib2 import cookielib def login(): url = "https://whatever.com/login.php" opts = { 'email': 'emailaddr', 'pass': 'password', } headers = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.0; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12', 'Accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'Accept-Language': 'en-gb,en;q=0.5', 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Connection': 'keep-alive' } data = urllib.urlencode(opts) request = urllib2.Request(url, data, headers) cookies = cookielib.CookieJar() cookies.extract_cookies(response,request) cookie_handler= urllib2.HTTPCookieProcessor( cookies ) redirect_handler= urllib2.HTTPRedirectHandler() opener = urllib2.build_opener(redirect_handler,cookie_handler) response = opener.open(request) return response.read() if __name__ == "__main__": print login()
As you see, this is pretty different than the previous examples. We use cookielib to create what is called a cookie jar that will store multiple cookies for us. Then, build_opener() is used to combine the http handler with the cookies handler and after that things are back to normal. Using the opener object, we just execute a simple open to that obect in order to get a file object with the response of the server.
Make sure that you use this in a good manner and don’t go about creating bots ! Or maybe you think it’s a good way to practice, huh ? I second that.
Practice, Right…
I knew urlib2 could be used to receive web page information, but I had no idea it could be used to create bots.
Great article.
Thanx Krow, yes it can be used for pretty much anything related to remote data retrieval and manipulation and is actually a pretty good way to create such bots.
When I try your example, I get the following error:
UnboundLocalError: local variable ‘response’ referenced before assignment
This is my first time i visit here. I discovered so many fascinating stuff in your pages especially its discussion. From the tons of comments on your articles, I presume I am not the only one having all the enjoyment here! Keep up the good work.
what if the information was in javascript ? Like there are a table of statistics of, for example, basket ball player results. And this table was done in javascript, how could i fetch the info in that table and save the values i wanted to the variables i created ?
@borja :
If you wanted to parse some data first based on what the server sends back to you, you would have to use a packet sniffer like wireshark. You would then inspect the packets to understand what data is returned. Then, you would be able to parse that data the way you are.
@Spyros:
And is that hard ? I’ve done some google’ing on wireshark but don’t exaclty understand how i could implement python to grab the specific info for me
@borja : Not at all. What you need to do is fire up wireshark and start executing what you want to inspect, in your browser. The requests that you send to the server and the responses that the server sends to you are what you will be getting from wireshark.
Then, you check the packets that you receive (the data) and find out what requests were made and what responses were returned to you. You’re effectively analyzing the protocol. If the server returns stuff to be displayed via javascript, you will see everything there.
If, on the other hand, the javascript is predefined and there is no messing with the server, please take a look at :
http://groups.csail.mit.edu/uid/chickenfoot/
http://www.riverbankcomputing.co.uk/static/Docs/PyQt4/html/qtwebkit.html
@Spyros: Wow thanks alot I will try it later right now its 7 am and have to sleep ! One last question, is there any security issue with wireshark ? Thnks loads!
@Spyros: OMG the chickenfoot extension to firefox is badass! That is really useful, too bad I don’t know javascript
but it really does look useful.
@borja : Javascript is pretty easy. If you would like to take a look at it, this is a great resource here :
http://eloquentjavascript.net/
excuse my kindergartenesque computer knowledge but I have a podcast and I am trying to understand my statistics. In trying to understand the statistic on USER AGENTS (?) it has Python-urllib with 10660 Downloads and iTunes with 1809 Downloads. I don’t know what the heck Python-urllib is, means, etc. Is there a way to explain it so a 46 yr.old NOT computer literate person (me) can understand?
Thanks
nooj, Python urllib is just a python library used to fetch url data. So, if you visit a website using python, you can get everything that a person sees to a program and perform anything you want with that data.
I am new to python. this article was very useful for me. thanks
Please go ahead and add more content to the site, I love it!
@aviation courses, thank you for your support, i will shortly be adding new posts
Most Ball pythons do that from what ive osrvbeed, all of my Ball-y`s Like to climb, my youngest Female sits on top of her Hide all day sunning herself. It really depends on how large your snake is. The Tank does not have to be too large, because Ball pythons actually prefer to have smaller space, i think it makes them feel safe. You may want to try putting a Climbing branch or something like that into her tank, you can get one at most pet stores, but i wouldn’t suggest finding one outside and putting it in the tank unless you sterilize it first.
Hi,
I want to create a script which opens browsers like Firefox, IE, Opera, Safari and Chrome etc at same time, or in a loop then opens a local html file, read the content in a frame and compares browser results to each other.
Could you please direct me if possible in brief.
I am pretty new to Python, I will be so greatful to you if you could help me on this.
Thanks,
– Shaira
@Shaira, please open up this discussion in the forum. This is a blog post about a different thing.
You, my friend are a badass. Thanks for this!
Are you on twitter by any chance?
Hi, thanx
I don’t actually have a twitter. Used to have one in the past, but i am not really into that still.
thank you for this tutorial. bookmarked!
AMAZED, I am. 😯 Can you please explain what “Cookies Handling” exactly is?
Hello Bharat, a cookie is just a way to persist data in a web browser. “Cookie handling” refers to the way urllib can handle these values.
How would I use this code if I use Google Chrome?
It doesn’t have to do anything with Chrome Bharat, it’s a standalone python program that you can execute. The cookies are stored locally via the python library urllib.
You mean, there’s no need to change the header code there that is about Mozilla?
That is not needed indeed, since it’s just a way to forge the user agent to pretend that the request comes from Mozilla Firefox.
I got you know. It doesn’t have to do anything with the web browser. We have to fake it all. 😉 Thank you.
Hey dude, I have got a question. Is a program possible which logs into my Facebook profile and then search randomly for people and or say girls, then visits their profiles and downloads their profile pictures?
As long as the images are public, it’s definitely possible yes.
Thank you. I am on my way for that. I will come here again if I need your help. 😉
Is there any way I can contact you? Because I have so much of questions. I would have asked them here but the are a bit off-topic.
Dude, I really some serious help. Please. Tell me a way to contact you.
rnfnm uccgi fpib oeqy gyghm.
Attractive component of content. I just stumbled upon your weblog
and in accession capital to assert that I acquire actually loved account
your blog posts. Anyway I will be subscribing for your feeds or even I success you get entry to persistently fast.
Every weekend i used to pay a visit this web site, as i
wish for enjoyment, as this this web site conations actually
nice funny information too.
Vraiment attrayant, selon moі cce poste intéresserait un pote
Bon pοste, pérennisez dans cette voie
Εncore uun poste visiblement intéressant
I like the valuable information you provide in your articles.
I’ll bookmark your blog and check again here frequently.
I am quite sure I will learn many new stuff right here!
Best of luck for the next!
J’ai trߋuvée votre blօg parr mégarde et puis je ne
le regrette poinbt !
Tiens je pensais rédigerսn article similairе au tiens
Je poste un petit ϲommjentaire uniգսement pour féliciter lee webmaster
Wonderful beat ! I wish to apprentice while you amend your website, how could i subscribe for a blog website?
The account helped me a appropriate deal. I were a little bit familiar of this your broadcast offered vivid clear idea
<满金星小叶紫檀
Using a bulb that works with a photocell is another way
to save energy.
I believe this is one of thhe such a lot vitgal info for
me. And i am glad reading your article. But should commehtary on ffew normal issues, The
webb site style is perfect, the articles is actuallpy great :
D. Just right activity, cheers
I blog frequently and I really appreciate your content. This great article has really peaked my interest.
I am going to take a note of your site and keep checking for new details
about once a week. I subscribed to your Feed too.
Hello admin i see you don’t earn on your website.
You can earn extra bucks easily, search on youtube for: how to earn selling articles
Hello there codercaste.com
SEO Link building is a process that requires a lot of time.
If you aren’t using SEO software then you will know the amount of work load involved in creating accounts, confirming emails and submitting your contents to thousands of websites in proper time and completely automated.
With THIS SOFTWARE the link submission process will be the easiest task and completely automated, you will be able to build unlimited number of links and increase traffic to your websites which will lead to a higher number of customers and much more sales for you.
With the best user interface ever, you just need to have simple software knowledge and you will easily be able to make your own SEO link building campaigns.
The best SEO software you will ever own, and we can confidently say that there is no other software on the market that can compete with such intelligent and fully automatic features.
The friendly user interface, smart tools and the simplicity of the tasks are making THIS SOFTWARE the best tool on the market.
IF YOU’RE INTERESTED, CONTACT ME ==> moneyrobotsubmitter@mail.com
Regards, Rodrigo
France, CENTRE, Le Havre, 76600, 65 Rue Michel Ange