#.think.in
learn.create.enjoy

Scraping this blog

March 18, 2010 23:31 by tarn

I have created a monster and this post is about killing it off by scraping the contents of this blog into structured Python objects. Sometime later I will convert the HTML content to markdown and download the images and other resources locally.

I want to put the contents into a ZODB object database to get a feel for working with object database. A greated goal is to migrate the content a new blog engine. I don't want to go into why I felt I need to scrape it or why I want to migrate to another blog engine as it's depressing.

Moving on, I wanted to put the content into these classes

class Post():
    title = ''

    content = ''
    date = ''
    tags = []
    comments = []


class Comment():
    content = ''
    author = ''
    date = ''

    website = ''

The scraping code is not elegant but was quite fun to write as I could write it all from an interactive console session. I found BeautifulSoup was fantastic in making HTML into something that was easy to work with, although I would have liked to have used jQuery/CSS style selectors.

from BeautifulSoup import BeautifulSoup

from datetime import datetime
import urllib2 
import re

def ParseComment(soup):
    comment = Comment()
    comment.author = soup.find('p',{"class":"author"}).first().string.strip
    content = soup.find('p',{"class":"content"})
    if content:
        comment.content = content.prettify()
    website = soup.find('p',{"class":"author"}).first()    
    if website.has_key('href'):
        comment.website = soup.find('p',{"class":"author"}).first()['href']
    r = re.compile('\d*/\d*/\d* \d*.\d*')    
    date = r.findall(soup.find('p',{"class":"date"}).renderContents())[0]
    comment.date = datetime.strptime(date,'%d/%m/%Y %H:%M')
    return comment


def ParsePost(postSoup):
    post = Post()    
    post.title =  postSoup.find('a',{"class":re.compile('posthead.*')}).string
    print post.title
    post.content = postSoup.find('div', {"class":"entry"})
    date = postSoup.find('div',{"class":"descr"}).contents[0][:-4]
    post.date = datetime.strptime(date,'%B %d, %Y %H:%M')
    post.author = postSoup.find('div',{"class":"descr"}).first().string
    post.tags = map(lambda x: x.string, postSoup('a',{"rel":"tag"}))
    comments = postSoup.find('div',{"id":"commentlist"})('div')
    post.comments = [ParseComment(commentSoup) for commentSoup in comments]
    return post


def DownloadPost(url):
    postHtml = urllib2.urlopen('http://blog.sharpthinking.com.au/' + url).read()
    postSoup = BeautifulSoup(postHtml)
    return ParsePost(postSoup);


def GetPosts():
    page = urllib2.urlopen("http://blog.sharpthinking.com.au/archive.aspx")
    soup = BeautifulSoup(page)
    postUrls = map(lambda x: x['href'], soup('a', href=re.compile('/post/.*')))
    return [DownloadPost(url) for url in postUrls[:-10]]



I'm sure there is better way, but this was better than any way I've used previously. Anyway I've done a lot of work untangling the mess I created.

>>> posts = GetPost()    
>>> for post in posts[:5]:

...     print post.date, post.title
...
2010-03-17 22:16:00 OMG. It's a JavaScript Rhino

2010-03-12 13:53:00 Devevenings Presentation - IOC/Unit Testing/Mocking in ASP.NET MVC

2010-02-20 17:18:00 Revisiting Pygments in the browser with Silverlight, now with BackgroundWorker

2010-02-17 19:25:00 Revisiting Modal Binding an Interface, now with DictionaryAdapterFactory
2010-02-16 20:34:00 Modal Binding an Interface with DynamicProxy


I wanted to put the contents into the object database tonight, but I have pickled it to be revisited later.


OMG. It's a JavaScript Rhino

March 17, 2010 22:16 by tarn

JavaScript is a slightly flawed language but it's got elegant parts too. All languages do to some degree, it's just JavaScript seems to have both in extremes. Whatever you think of it, history has made it the language for scripting the client-side web. It has become a mainstream language that shows no sign of falling off.

The excelent book JavaScript: The Good Parts by Douglas Crockford, working with the jQuery library and learning a little Lisp has lead me to really embrace JavaScript.

It's no secret that I like programming with interactive consoles and decided I wanted find out if there was an interactive console for JavaScript. A language that only lives inside web browser environment didn't seem right to me.

Rhino is a JavaScript implementation on the JVM. It has a compiler, a debugger and interactive console.

To get started you obviously need a version of the JVM. That's not to difficult. On Windows I just downloaded a Sun Java 6 installer. On my Ubuntu install I installed openjdk but found Rhino didn't work. So I installed sun-java6, which worked.

sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

You can find what version you've installed by running

$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Client VM (build 14.1-b02, mixed mode, sharing)

Excellent. The Rhino binaries includes js.jar which is needed for the console. Now should be able to run the jar

$ java -jar js.jar


When all goes well this will take you into the Rhino shell

Rhino 1.7 release 2 2009 03 22
js>

We can start playing with the language.

js> get_counter = function() { var counter = 0; return function() { print(counter); counter++; } };
..
js> counter1 = get_counter();
.. 
js> counter1();
0
js> counter1();
1
js> counter2 = get_counter();
..
js> counter2()
0
js> counter1()
2


Which is cool and there is some of the weirdness

js> '5' + 3
53
js> '5' - 2
3

And some interesting features

js> parseInt('06')
6
js> parseInt('08')
NaN
js> parseInt('10')
10
js> parseInt('010')
8


I'm looking forward to learning more about writing code in JavaScript.


Tags: , ,
Categories:
Comments (2)

Devevenings Presentation - IOC/Unit Testing/Mocking in ASP.NET MVC

March 12, 2010 13:53 by tarn

Here are the slides and the mysterious code that was never shown from my DevEvening presentation.

Appologies it's taken a while to get them up, I was hoping to write a bit of a post about what I covered and some of the discussion that came up. That never happened.

Devevening_Presentation.pptx (565.68 kb)

Guestbook.zip (3.42 mb)

I think the presentation went well, the group was really good and we got some good discussions happening before being interrupted by delicious paramas of the world.

I'm looking forward to the next meeting, except it appears I've signed up to represent NoSql (of which I currently know very little about) in an ORM smackdown.

That's what happens when you have meetings at a pub. Anyway it should fun and I'm looking forward to learning enough about NoSql to adequatly represent it in the smackdown.

See ya there.