#.think.in
learn.create.enjoy

Map-Reduce on Mongo

May 12, 2010 19:00 by tarn

I'm doing a presentation on non-relational databases at DDD Melbourne this weekend where I am going to demonstrate a map-reduce example with MongoDB and server side Javascript. I've been interested in both independently recently and it's been fun getting them to working together with some Javascript TDD to boot.

I needed a good example to demonstrate map-reduce and decided finding word occurrences across a series of documented seemed a simple enough scenario that is suited to being solved by a map-reduce query.

Below is an example of how we might solve this in plain C#

using System;
using System.Collections.Generic;

using System.Linq;
using System.Text;

class wordCounts {

    static void Main(string[] args) {

        // Setup some data

        List<string> lines = new List<string>() 
            { 
              "Peter Piper picked a peck of pickled peppers", 
              "A peck of pickled peppers Peter Piper picked",
              "If Peter Piper picked a peck of pickled peppers",
              "Where's the peck of pickled peppers Peter Piper picked?"

            };

        // select all words, group, count
        var wordCounts = lines.SelectMany(m => m.Split())
                              .GroupBy(m => m.ToLower())
                              .Select(m => new KeyValuePair<string, int>(m.Key, m.Count()));

        // Print out the results

        foreach (var wordCount in wordCounts) 
        {
            Console.WriteLine(string.Format("{0} {1}", wordCount.Value, wordCount.Key));     
        }
    }
}

This prints each different word in all the lines and the number of the times it occurs. The collection of strings is isomorphic to a collection of documents in the MongoDB for this example.

The SelectMany flattens lists of words from each line to a single list of words and the Group provides keys for each word, this is very similar to what the map function in the map-reduce query does.

The Select function is similar to the reduce function, but as we will see some additional considerations need to be made to allow it to be distributed.

I saw a good diagram ayande published on his blog but I didn't understand why he had multiple instance of the same document being mapped.

I created my own low key diagram to help demonstrate how a functional map-reduce could be distributed. The diagram shows the initial items can be split in half and reduced completely independently. This is interesting as it means our query can be distributed, but it also means we have to handle reducing a little differently.

It's also worth noting that this example shows a balanced tree, but it could be unbalanced and even introduce some redundancy.

MongoDB allows clients to send JavaScript map and reduce functions that will get eval'd and run on the server. Here is the map function.

function wordMap() {

    // try find words in document text
    var words = this.text.match(/\w+/g);

    if (words === null) { 
        return;
    }

    // loop every word in the document 

    for (var i = 0; i < words.length; i++) {
        // emit every word, with count of one

        emit(words[i], { count : 1 });
    }

}

The misunderstood Javascript "this" will be the context from which the function is called. Mongo will call function each document in the collection we are querying, and we can call it from a test context. Unlike the SelectMany the map function doesn't return a list, instead it calls an emit function which it expects to be defined.

We can write unit tests for this function by calling the function from a test mock context, calling a mock emit function (using Javascript as our mocking framework, wow).

eval(loadFile("src/js/wordMap.js"));


var emit;
var results;
var context;

testCases(test,

    function setUp() {
        emit = function (key, value) { 
            results.push({ key : key, value : value });
        };
        context = { text : "", map : wordMap };
        results = []; 
    },

    function empty_string_emits_nothing() {
        context.text = "";
        context.map();
        assert.that(results.length, eq(0));
    },

    function single_word_emits_single_word() {
        context.text = "findme";
        context.map();
        assert.that(results.length, eq(1));
        assert.that(results[0].key, eq("findme"));
        assert.that(results[0].value.count, eq(1));
    },

    function two_different_words_emits_twice() {
        context.text = "for bar";
        context.map();
        assert.that(results.length, eq(2));
    },

    function two_same_words_emits_twice() {
        context.text = "test test";
        context.map();
        assert.that(results.length, eq(2));
    },

    function tearDown() {
    }
);


The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

function wordReduce(key, values) {
        var total = 0;
        for (var i = 0; i < values.length; i++) {
            total += values[i].count;
        }
        return { count : total };
    }


Similarly we can test this method does exactly what we expect it to.

eval(loadFile("src/js/wordReduce.js"));

testCases(test,

    function reduce_one_items_returns_count_of_one() {
        var result = wordReduce("test", [{ count : 1 }]);
        assert.that(result.count, eq(1));
    },

    function reduce_multiple_items_returns_item_count() {
        var result = wordReduce("test", [{ count : 1 }, { count : 1 }, { count : 1 }]);
        assert.that(result.count, eq(3));
    },

    function reduce_sums_counts() {
        var result = wordReduce("test", [{ count : 2 }, { count : 3 }]);
        assert.that(result.count, eq(5));
    },

    function reduce_is_transitive() {
        var result = wordReduce("test", [{ count : 1 }].concat(
                        wordReduce("test", [{ count : 1 }, { count : 1 }]
                     ));
        assert.that(result.count, eq(3));
    }
);


I'm using Rhino to run the Javascript so I used RhinoUnit as a test runner as it also uses the JVM and runs as an ANT scriptdef task, the setup was pretty painless. Here are the relevant ANT script sections

<scriptdef name="rhinounit"

           src="lib/rhinoUnitAnt.js"
           language="javascript">
    <attribute name="options"/>
    <attribute name="ignoredglobalvars"/>

    <attribute name="haltOnFirstFailure"/>
    <attribute name="rhinoUnitUtilPath"/>
    <element name="fileset" type="fileset"/>

</scriptdef>

<target name="javascript-tests">
    <rhinounit options="{verbose:true, stackTrace:true}" 
               haltOnFirstFailure="false" 
               rhinoUnitUtilPath="lib/rhinoUnitUtil.js">

        <fileset dir="test">
            <include name="*.js"/>
        </fileset>
    </rhinounit>

</target>


The word count example recreated in Mongo using a Python client and passing the map/reduce functions to the server.

from pymongo import Connection;
from pymongo.code import Code;


# open connection and connect to 'ddd' database
connection = Connection()
db = connection.ddd

# remove any existing data
db.drop_collection("messages")


# insert some data
lines = open("data/peter_piper.txt").readlines();

for line in lines:
    db.messages.insert( { "text" : line } )


# load map and reduce functions
map = Code(open("src/js/wordMap.js","r").read())
reduce = Code(open("src/js/wordReduce.js","r").read())


# run the map-reduce query
result = db.messages.map_reduce(map, reduce)

# print the results    
for doc in result.find():
    print doc["value"]["count"],doc["_id"]


And it worked! I'd like to run the query on a larger result-set, but there isn't much point on this tiny low-spec'd netbook.


DevEvening NoSql/MongoDB Presentation

April 7, 2010 21:12 by tarn

The slides and demo I'll show in my NoSql presentation for tomorrows DevEvenings Melbourne ORM Smackdown. I hope to take the time and write a more considered post of my findings and opinions as I've found it very interesting.

Here is a link to the slides.

First I'll use some Python and the PyMongo module to connect to MongoDB, list databases, insert documents and get them out again.

Python 2.6.4 (r264:75706, Dec  7 2009, 18:45:15) 
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pymongo import Connection
>>> connection = Connection()
>>> connection.database_names()
[u'files', u'working', u'demo', u'downloads', u'posts', u'local', u'admin']
>>> db = connection.demo
>>> import datetime
>>> db.messages.insert( { 'author' : 'tarn', 'date': datetime.now(), 'message' : 'Hello Mongo' } )
ObjectId('4bbc4beec73d721445000003')
>>> db.messages.find_one()
{u'date': datetime.datetime(2010, 4, 7, 19, 10, 6, 355000), u'message': u'Hello Mongo', 
u'_id': ObjectId('4bbc4beec73d721445000003'), u'author': u'tarn'}

Then have a look at some content I have in the database

>>> db = connection.working
>>> db.posts.count()
106
>>> for post in db.posts.find()[:5]:
...     print post["date"],post["title"],"by",post["author"]
... 
2010-01-25 18:06:00 Python Silverlight/Moonlight 2 Xapping by tarn
2010-03-12 13:53:00 Devevenings Presentation - IOC/Unit Testing/Mocking in ASP.NET MVC by tarn
2010-02-17 19:25:00 Revisiting Modal Binding an Interface, now with DictionaryAdapterFactory by tarn
2009-12-02 18:35:00 Creating Silverlight apps in the browser by tarn
2009-10-02 23:08:00 #.think.in infoDose #43 (11th September - 22nd September) by brodie

File storage using the GridFS class from the gridfs module. Show some files and then write a file out to the file system.

>>> from gridfs import GridFS
>>> fs = GridFS(connection.files)
>>> len(fs.list())
116
>>> for file in fs.list()[:5]:
...     print file
... 
post/debugging-ironpython-with-my-excalibur/image.png
post/debugging-ironpython-with-my-excalibur/image_thumb.png
post/devevenings-presentation---iocunit-testingmocking-in-asp.net-mvc/20102f32fdevevening_presentation.pptx
post/devevenings-presentation---iocunit-testingmocking-in-asp.net-mvc/20102f32fguestbook.zip
post/think.in-infodose-40-5th-august---16th-august/image.png
>>>
>>> with open('image.png','w') as out_file:
...     with fs.open('post/debugging-ironpython-with-my-excalibur/image.png') as in_file:
...             out_file.write(in_file.read())
...

Now to some C# (mono) and a controller class for a basic web application to view the data. It serves files found in the database, but only sends bach the correct MIME type for "image/png". Lazy.

public class HomeController : Controller
{
    BlogRepository _blogRepository;

    public HomeController()
    {
        _blogRepository = new BlogRepository(); 
    }

    public ActionResult Index ()
    {
        ViewData["posts"] = _blogRepository.GetPosts();
        return View ();
    }

    public ActionResult Entry(string id)
    {
        ViewData["post"] = _blogRepository.GetById(id);
        return View ();
    }

    public ActionResult Resource(string slug, string fileName)
    {
        return new FileStreamResult(_blogRepository.GetFile( "post/" + slug + "/" + fileName ), "image/png");
    }

}

This is a very basic repository that provides the data for the demo application. I did only enough with the C# provider to get it working and try to disconnect my connections.

public class BlogRepository
{
    Mongo _mongo;

    public BlogRepository()
    {
        string connstr = ConfigurationManager.AppSettings["connectionString"];
        _mongo = new Mongo(connstr);
    }

    public Stream GetFile(string name)
    {
        try
        {
            _mongo.Connect();
            var db = _mongo["files"];
            var fs = new GridFile(db);
            Stream data = fs.Open(name, FileMode.Open, FileAccess.Read);
            Stream output = new MemoryStream();
            CopyStream(data,output);
            output.Seek(0,SeekOrigin.Begin);
            return output;
        }
        finally
        {
            _mongo.Disconnect();    
        }
    }

    public List<Document> GetPosts()
    {
        try
        {
            _mongo.Connect(); 
            var db = _mongo["working"];
            var posts = db["posts"];
            using(ICursor all = posts.Find(new Document())){
                return all.Documents.ToList();
            }
        }
        finally
        {
            _mongo.Disconnect();
        }
    }

    public Document GetById(string id)
    {   
        try
        {
            _mongo.Connect();
            var db = _mongo["working"];
            var posts = db["posts"];
            Document doc = posts.FindOne( new Document() {{ "_id" , new Oid(id) }} );
            return doc;
        }
        finally
        {
            _mongo.Disconnect();    
        }
    }

    public static void CopyStream(Stream input, Stream output)
    {
        byte[] buffer = new byte[32768];
        while (true)
        {
            int read = input.Read (buffer, 0, buffer.Length);
            if (read <= 0)
                return;
            output.Write (buffer, 0, read);
        }
    }
}

So that's where I got for my demo for the DevEvenings ORM Smackdown. No doubt I will continue looking into MongoDB and other object/document databases.


Scraping this blog

March 18, 2010 23:31 by tarn

I have created a monster and this post is about killing it off by scraping the contents of this blog into structured Python objects. Sometime later I will convert the HTML content to markdown and download the images and other resources locally.

I want to put the contents into a ZODB object database to get a feel for working with object database. A greated goal is to migrate the content a new blog engine. I don't want to go into why I felt I need to scrape it or why I want to migrate to another blog engine as it's depressing.

Moving on, I wanted to put the content into these classes

class Post():
    title = ''

    content = ''
    date = ''
    tags = []
    comments = []


class Comment():
    content = ''
    author = ''
    date = ''

    website = ''

The scraping code is not elegant but was quite fun to write as I could write it all from an interactive console session. I found BeautifulSoup was fantastic in making HTML into something that was easy to work with, although I would have liked to have used jQuery/CSS style selectors.

from BeautifulSoup import BeautifulSoup

from datetime import datetime
import urllib2 
import re

def ParseComment(soup):
    comment = Comment()
    comment.author = soup.find('p',{"class":"author"}).first().string.strip
    content = soup.find('p',{"class":"content"})
    if content:
        comment.content = content.prettify()
    website = soup.find('p',{"class":"author"}).first()    
    if website.has_key('href'):
        comment.website = soup.find('p',{"class":"author"}).first()['href']
    r = re.compile('\d*/\d*/\d* \d*.\d*')    
    date = r.findall(soup.find('p',{"class":"date"}).renderContents())[0]
    comment.date = datetime.strptime(date,'%d/%m/%Y %H:%M')
    return comment


def ParsePost(postSoup):
    post = Post()    
    post.title =  postSoup.find('a',{"class":re.compile('posthead.*')}).string
    print post.title
    post.content = postSoup.find('div', {"class":"entry"})
    date = postSoup.find('div',{"class":"descr"}).contents[0][:-4]
    post.date = datetime.strptime(date,'%B %d, %Y %H:%M')
    post.author = postSoup.find('div',{"class":"descr"}).first().string
    post.tags = map(lambda x: x.string, postSoup('a',{"rel":"tag"}))
    comments = postSoup.find('div',{"id":"commentlist"})('div')
    post.comments = [ParseComment(commentSoup) for commentSoup in comments]
    return post


def DownloadPost(url):
    postHtml = urllib2.urlopen('http://blog.sharpthinking.com.au/' + url).read()
    postSoup = BeautifulSoup(postHtml)
    return ParsePost(postSoup);


def GetPosts():
    page = urllib2.urlopen("http://blog.sharpthinking.com.au/archive.aspx")
    soup = BeautifulSoup(page)
    postUrls = map(lambda x: x['href'], soup('a', href=re.compile('/post/.*')))
    return [DownloadPost(url) for url in postUrls[:-10]]



I'm sure there is better way, but this was better than any way I've used previously. Anyway I've done a lot of work untangling the mess I created.

>>> posts = GetPost()    
>>> for post in posts[:5]:

...     print post.date, post.title
...
2010-03-17 22:16:00 OMG. It's a JavaScript Rhino

2010-03-12 13:53:00 Devevenings Presentation - IOC/Unit Testing/Mocking in ASP.NET MVC

2010-02-20 17:18:00 Revisiting Pygments in the browser with Silverlight, now with BackgroundWorker

2010-02-17 19:25:00 Revisiting Modal Binding an Interface, now with DictionaryAdapterFactory
2010-02-16 20:34:00 Modal Binding an Interface with DynamicProxy


I wanted to put the contents into the object database tonight, but I have pickled it to be revisited later.


Revisiting Pygments in the browser with Silverlight, now with BackgroundWorker

February 20, 2010 17:18 by tarn

A couple of week ago I blogged about using Pygments to do live syntax highlighting in the browser using Silverlight.

A major problem with the sample was that it did the pygmentizing on the UI thread which caused most browsers to become unresponsive. Today I wanted to fix that by using the BackgroundWorker to do the pygmentizing in a background thread.

Firstly I refactored the pygmentizing into a method that didn't interact with the UI.

def pygmentize_text(self, text, language):
    # attempt to pygmentize input with current language 
    try:

        from pygments import highlight
        from pygments.lexers import get_lexer_by_name
        from pygments.formatters import HtmlFormatter

        lexer = get_lexer_by_name( language, stripall=True)
        formatter = HtmlFormatter(linenos=False, cssclass="source")
        markup = highlight(text, lexer, formatter)

        return markup

    except:

        return "Error Generating Markup"



I then added a method that could be passed into a DoWorkEventHandler. It gets it arguments as a tuple from the event arguments and then sets the event argument result with the marked up HTML. The lack of explicit typing and use of tuples is good example of how some python idioms can be used when working with the .NET framework.

def worker(self, sender, e):

    # do work off UI thread. 
    e.Result = self.pygmentize_text(e.Argument[0],e.Argument[1])


The required BackgroundWorker and DoWorkEventHandler can be simply imported from the System.ComponentModel namespace.

from System.ComponentModel import BackgroundWorker, DoWorkEventHandler


The BackgroundWorker can then be setup and started. Again it's syntactically nice how the tuple can be created and passed as a RunWorkerAsync parameter.

def start_pygmentize(self):

    # update application state
    self.input_changed = False        
    self.pygmentizing = True
    self.show_message("pygmentizing..")

    # get paremters
    input = self.input.GetProperty("value")
    language = self.language.value

    # setup background worker
    worker = BackgroundWorker()
    worker.DoWork += DoWorkEventHandler(self.worker)
    worker.RunWorkerCompleted += self.complete

    # start the worker
    worker.RunWorkerAsync( (input,language) )


The completed event handler is the responsible for taking the markup generated by the BackgroundWorker and updating the DOM. It also fires off another worker if the source has changed since the last worker started.

def complete(self, sender, e):

    if e.Error:

        # handle errors/exceptions in worker
        self.source.SetProperty("innerHTML",e.Error.Message)

    else:

        # show the result
        self.source.SetProperty("innerHTML",e.Result)

    if self.input_changed:

        # input has changed, starty pygmentize again
        self.start_pygmentize()

    else:

        # no work queued
        self.pygmentizing = False
        self.hide_message()



The update has made the sample much more responsive, however it appears downloading the Silverlight application is still causing some browers to become a little unresponsive which is annoying. I will be interested to find out if this effect can be mitigated.

The actually pygmentizing processing could possibly be made a little faster by reusing the BackgroundWorker and only doing the Pygment imports once but the responsiveness of the browser has improved the sample enormously.

Check out the updated demo here.


Tags: , ,
Categories:
Comments (0)

Scripting your Data Model

February 14, 2010 14:13 by tarn

I really like the way you can script your data model from a python REPL console on the Django and the Google App Engine web frameworks.

For me it is a hands down better way of working with data in your domain model than writing SQL.

I've since wanted to do it in .NET projects I work on, but it wasn't till I was playing with Castle ActiveRecord yesterday that I decided I'd try it out.

The ActiveRecord pattern is an intuitive way of programming with data persistence, so it's also nice to script with.

It turned out to be really easy in the project I was playing with. I didn't need to write a single additional line of C# as I already had the configuration decoupled for integration testing with an in-memory SQLite database.

namespace SimpleBlog.Data
{
    public class Configuration : IBootstrapperTask
    {
        public void Execute()
        {
            Configure(GetDefaultSettings());
        }

        public static void Configure(IDictionary<string, string> properties)
        {
            InPlaceConfigurationSource source = new InPlaceConfigurationSource();
            source.Add(typeof(ActiveRecordBase), properties);
            ActiveRecordStarter.Initialize(source);
            ActiveRecordStarter.RegisterAssemblies(Assembly.GetExecutingAssembly());
        }

        ...
    }
}


I then just wrote this little script to help with the configuration, it uses the static method above and passes in the properties for working with a development database.

import clr
clr.AddReferenceToFile("SimpleBlog.Data.dll")

from System.Collections.Generic import Dictionary
from SimpleBlog.Data import Configuration

# NHibinate Setting

properties = Dictionary[str,str]()

properties.Add("connection.driver_class",
               "NHibernate.Driver.SqlClientDriver");

properties.Add("dialect",
           "NHibernate.Dialect.MsSql2005Dialect");

properties.Add("connection.provider",
           "NHibernate.Connection.DriverConnectionProvider");

properties.Add("connection.connection_string",
           "Data Source=[CONNECTION_STRING]"); # Add

properties.Add("proxyfactory.factory_class",
           "NHibernate.ByteCode.Castle.ProxyFactoryFactory, NHibernate.ByteCode.Castle");

Configuration.Configure(properties)


Using the helper script it's pretty easy to get in and start working with the data in the data model.

IronPython 2.6 (2.6.10920.0) on .NET 2.0.50727.4927
Type "help", "copyright", "credits" or "license" for more information.
>>> import ActiveRecord
>>> from SimpleBlog.Data.Models import *
>>>
>>> post = Post()
>>> post.Title = "Working with SQL sucks!"
>>> post.Content = "Try using a scripting langauge instead. It rocks!"
>>> post.Author = "tarn"
>>> post.Save()
>>>
>>> post.Id
2
>>>
>>> posts = Post().FindAll()
>>>    
>>> for p in posts:
...     print p.Title, "by", p.Author
...
Hey, It's alive by tarn
Working with SQL sucks! by tarn
>>>


This is a simple example of a database agnostic data script using your domain model and a powerful scripting language. I think scripting data models like this could add a lot of value in many .NET development scenarios.


Pygments in the browser with Silverlight

January 31, 2010 23:40 by tarn

17-02-2010 I've updated the demo to use the BackgroundWorker and posted about the update

I decided it might be fun to try get Python Markdown and Pygments running in the browser to enhance a markdown preview experience by eliminating the server-side round trips and provide a more responsive preview.

I managed to get it working entirely in Python but I found the application size excessively large (almost 3mb) and, more annoyingly, it seems to block the entire browser when it initially loads the pygments module. I think there is some sort of silverlight background thread I should be using.

I think it would work better with MarkdownSharp as pure C# silverlight applications are a fair bit leaner in size and probably run a little quicker than dynamic language applications. But this was for fun and I prefer coding in Python when not working.

17-02-2010 I've since found the MarkdownSharp doesn't do syntax hightlighting

I ended up having to write so little Python code to get this working that I can include it all here, syntax highlighted with Pygments of course.

from System.Windows import Application
from System.Windows.Controls import UserControl
from System.Windows.Browser import HtmlPage
from System import EventHandler

class App:

    def __init__(self):

        # load relevent HTML DOM elements
        self.input = HtmlPage.Document.GetElementById("input")
        self.source = HtmlPage.Document.GetElementById("output")
        self.language = HtmlPage.Document.GetElementById("lang")

        # fire javascript functions to indicate the application has been load
        HtmlPage.Window.CreateInstance("silverlight_loaded");

        # pygmentize initial 
        self.pygmentize()

        # register events
        self.input.AttachEvent('onkeyup', EventHandler( self.update_handler )) 
        self.language.AttachEvent('onchange', EventHandler( self.update_handler ))

        # fire javascript function to indicated the pygments has been loaded
        HtmlPage.Window.CreateInstance("pygments_loaded");

   # handle language or input changes by pygmentizing 
    def update_handler(self, sender, e):

        self.pygmentize()

    def pygmentize(self):
        input = self.input.GetProperty("value")

        # attempt to pygmentize input with current language 
        try:

            from pygments import highlight
            from pygments.lexers import get_lexer_by_name
            from pygments.formatters import HtmlFormatter

            lexer = get_lexer_by_name(self.language.value, stripall=True)
            formatter = HtmlFormatter(linenos=False, cssclass="source")
            markup = highlight(input, lexer, formatter)

            # update the preview
            self.source.SetProperty("innerHTML",markup)

        except:

            # indicate there was an error in pygmentize
            self.source.SetProperty("innerHTML", "Error Generating Markup" )

# Do it!    
App()


Despite the fact that there isn't much code, the development experience writing silverlight application in python is a bit of a pain in the arse. Granted it's much better with the python console in the browser and better error reporting in most recent SDK, but it still sucks; debugging and logging support is very limited and on some errors the application dies without reporting anything.

Another difficulty is having to manually copy all the python standard library modules required by the module from the library folders into the application (which explains something about the bloated application size). And even though the code in the demo works, some very similar code from the pygments quick start doesn't.

Check out the live demo.


Tags: , ,
Categories:
Comments (0)