#.think.in
learn.create.enjoy

Map-Reduce on Mongo

May 12, 2010 19:00 by tarn

I'm doing a presentation on non-relational databases at DDD Melbourne this weekend where I am going to demonstrate a map-reduce example with MongoDB and server side Javascript. I've been interested in both independently recently and it's been fun getting them to working together with some Javascript TDD to boot.

I needed a good example to demonstrate map-reduce and decided finding word occurrences across a series of documented seemed a simple enough scenario that is suited to being solved by a map-reduce query.

Below is an example of how we might solve this in plain C#

using System;
using System.Collections.Generic;

using System.Linq;
using System.Text;

class wordCounts {

    static void Main(string[] args) {

        // Setup some data

        List<string> lines = new List<string>() 
            { 
              "Peter Piper picked a peck of pickled peppers", 
              "A peck of pickled peppers Peter Piper picked",
              "If Peter Piper picked a peck of pickled peppers",
              "Where's the peck of pickled peppers Peter Piper picked?"

            };

        // select all words, group, count
        var wordCounts = lines.SelectMany(m => m.Split())
                              .GroupBy(m => m.ToLower())
                              .Select(m => new KeyValuePair<string, int>(m.Key, m.Count()));

        // Print out the results

        foreach (var wordCount in wordCounts) 
        {
            Console.WriteLine(string.Format("{0} {1}", wordCount.Value, wordCount.Key));     
        }
    }
}

This prints each different word in all the lines and the number of the times it occurs. The collection of strings is isomorphic to a collection of documents in the MongoDB for this example.

The SelectMany flattens lists of words from each line to a single list of words and the Group provides keys for each word, this is very similar to what the map function in the map-reduce query does.

The Select function is similar to the reduce function, but as we will see some additional considerations need to be made to allow it to be distributed.

I saw a good diagram ayande published on his blog but I didn't understand why he had multiple instance of the same document being mapped.

I created my own low key diagram to help demonstrate how a functional map-reduce could be distributed. The diagram shows the initial items can be split in half and reduced completely independently. This is interesting as it means our query can be distributed, but it also means we have to handle reducing a little differently.

It's also worth noting that this example shows a balanced tree, but it could be unbalanced and even introduce some redundancy.

MongoDB allows clients to send JavaScript map and reduce functions that will get eval'd and run on the server. Here is the map function.

function wordMap() {

    // try find words in document text
    var words = this.text.match(/\w+/g);

    if (words === null) { 
        return;
    }

    // loop every word in the document 

    for (var i = 0; i < words.length; i++) {
        // emit every word, with count of one

        emit(words[i], { count : 1 });
    }

}

The misunderstood Javascript "this" will be the context from which the function is called. Mongo will call function each document in the collection we are querying, and we can call it from a test context. Unlike the SelectMany the map function doesn't return a list, instead it calls an emit function which it expects to be defined.

We can write unit tests for this function by calling the function from a test mock context, calling a mock emit function (using Javascript as our mocking framework, wow).

eval(loadFile("src/js/wordMap.js"));


var emit;
var results;
var context;

testCases(test,

    function setUp() {
        emit = function (key, value) { 
            results.push({ key : key, value : value });
        };
        context = { text : "", map : wordMap };
        results = []; 
    },

    function empty_string_emits_nothing() {
        context.text = "";
        context.map();
        assert.that(results.length, eq(0));
    },

    function single_word_emits_single_word() {
        context.text = "findme";
        context.map();
        assert.that(results.length, eq(1));
        assert.that(results[0].key, eq("findme"));
        assert.that(results[0].value.count, eq(1));
    },

    function two_different_words_emits_twice() {
        context.text = "for bar";
        context.map();
        assert.that(results.length, eq(2));
    },

    function two_same_words_emits_twice() {
        context.text = "test test";
        context.map();
        assert.that(results.length, eq(2));
    },

    function tearDown() {
    }
);


The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

function wordReduce(key, values) {
        var total = 0;
        for (var i = 0; i < values.length; i++) {
            total += values[i].count;
        }
        return { count : total };
    }


Similarly we can test this method does exactly what we expect it to.

eval(loadFile("src/js/wordReduce.js"));

testCases(test,

    function reduce_one_items_returns_count_of_one() {
        var result = wordReduce("test", [{ count : 1 }]);
        assert.that(result.count, eq(1));
    },

    function reduce_multiple_items_returns_item_count() {
        var result = wordReduce("test", [{ count : 1 }, { count : 1 }, { count : 1 }]);
        assert.that(result.count, eq(3));
    },

    function reduce_sums_counts() {
        var result = wordReduce("test", [{ count : 2 }, { count : 3 }]);
        assert.that(result.count, eq(5));
    },

    function reduce_is_transitive() {
        var result = wordReduce("test", [{ count : 1 }].concat(
                        wordReduce("test", [{ count : 1 }, { count : 1 }]
                     ));
        assert.that(result.count, eq(3));
    }
);


I'm using Rhino to run the Javascript so I used RhinoUnit as a test runner as it also uses the JVM and runs as an ANT scriptdef task, the setup was pretty painless. Here are the relevant ANT script sections

<scriptdef name="rhinounit"

           src="lib/rhinoUnitAnt.js"
           language="javascript">
    <attribute name="options"/>
    <attribute name="ignoredglobalvars"/>

    <attribute name="haltOnFirstFailure"/>
    <attribute name="rhinoUnitUtilPath"/>
    <element name="fileset" type="fileset"/>

</scriptdef>

<target name="javascript-tests">
    <rhinounit options="{verbose:true, stackTrace:true}" 
               haltOnFirstFailure="false" 
               rhinoUnitUtilPath="lib/rhinoUnitUtil.js">

        <fileset dir="test">
            <include name="*.js"/>
        </fileset>
    </rhinounit>

</target>


The word count example recreated in Mongo using a Python client and passing the map/reduce functions to the server.

from pymongo import Connection;
from pymongo.code import Code;


# open connection and connect to 'ddd' database
connection = Connection()
db = connection.ddd

# remove any existing data
db.drop_collection("messages")


# insert some data
lines = open("data/peter_piper.txt").readlines();

for line in lines:
    db.messages.insert( { "text" : line } )


# load map and reduce functions
map = Code(open("src/js/wordMap.js","r").read())
reduce = Code(open("src/js/wordReduce.js","r").read())


# run the map-reduce query
result = db.messages.map_reduce(map, reduce)

# print the results    
for doc in result.find():
    print doc["value"]["count"],doc["_id"]


And it worked! I'd like to run the query on a larger result-set, but there isn't much point on this tiny low-spec'd netbook.


OMG. It's a JavaScript Rhino

March 17, 2010 22:16 by tarn

JavaScript is a slightly flawed language but it's got elegant parts too. All languages do to some degree, it's just JavaScript seems to have both in extremes. Whatever you think of it, history has made it the language for scripting the client-side web. It has become a mainstream language that shows no sign of falling off.

The excelent book JavaScript: The Good Parts by Douglas Crockford, working with the jQuery library and learning a little Lisp has lead me to really embrace JavaScript.

It's no secret that I like programming with interactive consoles and decided I wanted find out if there was an interactive console for JavaScript. A language that only lives inside web browser environment didn't seem right to me.

Rhino is a JavaScript implementation on the JVM. It has a compiler, a debugger and interactive console.

To get started you obviously need a version of the JVM. That's not to difficult. On Windows I just downloaded a Sun Java 6 installer. On my Ubuntu install I installed openjdk but found Rhino didn't work. So I installed sun-java6, which worked.

sudo apt-get install sun-java6-bin sun-java6-jre sun-java6-jdk

You can find what version you've installed by running

$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Client VM (build 14.1-b02, mixed mode, sharing)

Excellent. The Rhino binaries includes js.jar which is needed for the console. Now should be able to run the jar

$ java -jar js.jar


When all goes well this will take you into the Rhino shell

Rhino 1.7 release 2 2009 03 22
js>

We can start playing with the language.

js> get_counter = function() { var counter = 0; return function() { print(counter); counter++; } };
..
js> counter1 = get_counter();
.. 
js> counter1();
0
js> counter1();
1
js> counter2 = get_counter();
..
js> counter2()
0
js> counter1()
2


Which is cool and there is some of the weirdness

js> '5' + 3
53
js> '5' - 2
3

And some interesting features

js> parseInt('06')
6
js> parseInt('08')
NaN
js> parseInt('10')
10
js> parseInt('010')
8


I'm looking forward to learning more about writing code in JavaScript.


Tags: , ,
Categories:
Comments (2)

AdRotator: Injecting scripts

June 15, 2008 18:55 by tarn

This article is part of a series of posts about various aspects of writing web controls for ASP.Net using an ad rotator as an example. The AdRotator WebControl Example post has links to related posts and downloads.

The AdRotator needs to insert a class definition script in the output once, even if multiple AdRotator controls are declared on the page. It also needs to insert a script that instantiates that class prototype for every AdRotator control on the page. Fortunately there are static methods on the System.Web.UI.Page object that will do this for us.

Page.ClientScript.RegisterStartupScript(Type type, string key, string script)

The keys here a crucial, there is only one script rendered for each key. If we inject the definition script with a static key it means no matter how many instances control try to register it, only one instance of script will be sent to the output.

To to insert a unique script for every instance of the AdRotator we need to use a unique key. I use the server side controls ClientId property which will be difference for every instance of the control on any page. You could also use a GUID.

This is clearly the weakest post of all the posts about the AdRotator. I just kind of got this working an never really got very inspired to learn anything else about it. I posted it anyway as I felt I did need to cover this topic for completeness of the series. I'll try to update it soon.


AdRotator: Json serialization

June 15, 2008 18:46 by tarn

This article is part of a series of posts about various aspects of writing web controls for ASP.Net using an ad rotator as an example. The AdRotator WebControl Example post has links to related posts and downloads.

The AdRotator needs to pass some data to the client side code so I use a Json serializer. I use the same class we use on the server side. The Json serializer converts an instance of it, or in our case a templated list of it, to a sting on JSON we can emit in the output. Json is JavaScript Object Notation and literally describes an object in Javascript that we can use in our client side Javascript code.

The object is very simple and only contains three strings and constructors. I have removed some of the attributes described in other articles as they are not relevant in this context.

public class ImageItem
{
    public ImageItem()
        : this(string.Empty, string.Empty, string.Empty)
    {
    }

    public ImageItem(string linkUrl, string imageUrl, string displayTime)
    {
        LinkUrl = linkUrl;
        ImageUrl = imageUrl;
        DisplayTime = displayTime;
    }

    public string LinkUrl { get; set; }
    public string ImageUrl { get; set; }
    public string DisplayTime { get; set; }
}

 

The AdRotator has a public property Images that is List of the ImageItem type above.

public List<ImageItem> Images;

 

During the RenderContents event of the AdRotator we can serialize this list into a string of Json.

JavaScriptSerializer serializer = new JavaScriptSerializer(); 
string json = serializer.Serialize(Images); 
 

The Json will look something like this.

[{"ImageUrl":"Images/Winter.jpg","LinkUrl":"#","DisplayTime":"1000"},
{"ImageUrl":"Images/Sunset.jpg","LinkUrl":"#","DisplayTime":"4000"}]

The following script and corresponding output will hopefully demonstrate how we can use the Json on the client side as data.

var imageList = [{"ImageUrl":"Images/Winter.jpg","LinkUrl":"#","DisplayTime":"1000"},
                 {"ImageUrl":"Images/Sunset.jpg","LinkUrl":"#","DisplayTime":"4000"}];

for(var i=0; i<imageList.length; i++)
{
    document.writeln(imageList[0].ImageUrl);
}

Output:

Images/Winter.jpg 
Images/Winter.jpg

 

Well that's all the information I'm going to include in this article. I'd like to go further into how the Json is used in the AdRotator but I've rambled on long enough. I think there is enough here to see that Json is a pretty cool language for communicating data between the server and client side Javascript.

I should also note that the JavaScriptSerializer is marked as depreciated and points towards the DataContractJsonSerializer. I briefly tried to get this working, but I couldn't seem to find the object in my system assemblies. To read more about that, check out DataContractJsonSerializer in .NET 3.5 that discusses using the DataContractJsonSerializer.

I hope you've found the article interesting, if you have you might be interested in reading more article about the AdRotator example or you might want to download and checkout the example project.


AdRotator: Client side code

June 15, 2008 18:23 by tarn

This article is part of a series of posts about various aspects of writing web controls for ASP.Net using an ad rotator as an example. The AdRotator WebControl Example post has links to related posts and downloads.

The Prototype

I wanted the client side code to be as object orientated as possible.  I implemented the behaviour of our AdRotator as  a prototyped Javascript object, equivalent to a class definition in other OO languages. This means every instance of the AdRotator on a page only needs to be an instance of the ad rotator prototype.

I create the anchor and image elements on the fly. I need the client side object to know about these elements to change them when the image changes. I could have created them in the server-side and passed their ids to find them in the DOM, or found them by looking through our containers child controls. In the end creating them on client side was just slightly easier.

The images parameter is an array of objects that have ImageUrl, LinkUrl and DisplayTime properties. I pass it in as Json (Javascript object notion) that I generated on the server using a Json serializer. You'll see the Json later in this article, and I've also added an article about generating Json script on the server side.

You notice the strange way setTimeout is called in the RotateAd function. It is all about Javascript scope and is far to big an issue to go into here. All I will say is that if I had used "this" instead on the "thisObject" which is set to "this" anyway, it might not point to this instance of this class, it might actually point to the document, window or whatever the current "this" actually is when the timeout event occurs.

function AdRotator(id, ads, height, width)
{
   this.id = id;
   this.ads = ads;
   this.index = 0;
   this.container = document.getElementById(id);

   this.anchorElement = document.createElement('a');
   this.imageElement = document.createElement('img');
   this.imageElement.setAttribute('height', height);   
   this.imageElement.setAttribute('width', height);
   this.anchorElement.appendChild(this.imageElement);
   this.container.appendChild(this.anchorElement);
   this.RotateAd();
}

AdRotator.prototype.RotateAd = function()
{
    var currentAd = this.NextAd();
    this.imageElement.setAttribute('src', currentAd.ImageUrl);
    this.anchorElement.setAttribute('href', currentAd.LinkUrl);
    var thisObject = this;
    setTimeout(function() { thisObject.RotateAd(); }, currentAd.DisplayTime);
}

AdRotator.prototype.NextAd = function()
{
    var ad = this.ads[this.index];
    this.index ++;
    if (this.index == this.ads.length) this.index = 0;
    return ad;
}

 

The Instance

For each instance of the ad rotator we simply use this script. You'll immediately notice that the code is filled with place holders. I replace these placed holders on the server-side before I inject it into the output. I'll describe these in more detail and finally I'll include the source of a rendered page to see this as the browser sees.

You might also notice that we are not actually assigning this object to anything. Hopefully because it will always be referenced by a setTimeout event it will always stay in scope. If I ever do have any problem with this I could easily create and array in the definition and push each instance into the array.

(new AdRotator('$ElementId', $Images, $Height, $Width));

 

The $ElementId placeholder is replaced with server control the server controls ClientId property. This means we can reference the DOM object (a span element) that the server outputs as a container for the control. The ClientId property is very importing when creating elements on the server and interacting with them on the client side. Often elements are not output with the ID of the server control. One of the reasons is that its possible in ASP.Net to have two control with the same ID in different placeholders. Or the same ID in a master page and in the content page. I won't go into detail about how this naming works, but its pretty clear when you view the source of the pages rendered by an ASP.Net server. It is a string as it's used by definition to find an element with the getElementById function.

The $Images place holder is replaced with Json. Effectively a Javascript object describing an object which is a an array of objects containing ImageUrl, LinkUrl and DisplayTime properties. I have written an article about how this Json is generated on the server side.

The $Height and $Width place holders are replaced by the corresponding values on the AdRotator server control.

An example of how this script is actually output by the server should hopefully demonstrates how this client side script actually works, and hopefully how it works for multiple controls.

<script type="text/javascript">
(new AdRotator('ctl00_MainContentPlaceHolder_rotator1', 
               [{"ImageUrl":"Images/Winter.jpg","LinkUrl":"Winter.aspx","DisplayTime":"1000"},
                {"ImageUrl":"Images/Sunset.jpg","LinkUrl":"Sunset.aspxd","DisplayTime":"4000"}],
                200, 200))
</script>

 

This is just one article in a series describing various aspects of writing ASP.Net server side controls. You can download the example code or assembly and see how it works.


AdRotator WebControl Example

June 15, 2008 18:01 by tarn

I didn't really actually want to write another ad rotator, but I did think it would be a good example control to implement. It has the scope to cover some aspects of writing web controls I wanted to learn more about and discuss. I decided to build an ASP.Net web control that cycles through a list of images on the client side. There are a few areas I want to discuss.

I wanted the control completely encapsulated within an assembly. So all you'd need to use the control would be a copy of the assembly. To do this I embedded some resources in the assembly and read them out at runtime. I discuss this in the article.

I wanted the control to be able to be created completely declaratively, and I thought providing the sort of intellisense you get when using standard ASP.Net controls would be nice.

I wanted the server-side control to inject Javascript into the output. I wanted the control to be smart enough to only output the Javascript prototype, or class definition once, even if there were multiple controls on the page. I wanted the client side code to be as object orientated as possible. 

You can download the source code which includes an example website in the solution.

I would like to add a post about supporting difference storage mechanisms and using the control, but I have other projects demanding my attention. Hopefully I'll get back to it sometime.

I Hope you've found some of these articles useful or interesting. I'll keep and eye on the comments, contribute to further discussion, and update the posts where errors or omissions are noted.

Cheers