MapReduce can be a tricky concept to get your head around initially, especially if you are from a SQL background as I am. The first ever MapReduce I needed to write was to count both total and unique (by user ID) occurences of URLs in a click tracking collection in a MongoDB database. I read a number of blog posts on the subject but with the penny not quite having dropped I thought I'd put my question on Stack Overflow. I got a great reply from Ren, his/her very first SO post in fact.
Ren's reply is useful as is, but while trying to explain it to others I've found it useful to include some extra information. It's been on my to-do list to write this up (and indeed, start a blog at all) for some time so here goes.
Example collection
[
{
_id: ObjectId("4f33f53542b413f602000008"),
redirect: "http://emberjs.com/",
userId: 1,
remoteAddr: "0.0.0.0"
},
{
_id: ObjectId("4f33f2d042b413f602000005"),
redirect: "http://emberjs.com/",
userId: 1,
remoteAddr: "0.0.0.0"
},
{
_id: ObjectId("4f33f3b742b413f602000007"),
redirect: "http://emberjs.com/",
userId: 2,
remoteAddr: "0.0.0.0"
},
{
_id: ObjectId("4f33f28842b413f602000004"),
redirect: "http://documentcloud.github.com/backbone/",
userId: 3,
remoteAddr: "0.0.0.0"
},
{
_id: ObjectId("4f33f12942b413f602000003"),
redirect: "http://documentcloud.github.com/backbone/",
userId: 3,
remoteAddr: "0.0.0.0"
}
]
Map function
This function will run for every object in the collection. In Mongo you can also pass simple filter conditions to be run before the MapReduce takes place. But more on that towards the end of the article.
function() {
if (this.redirect) { // only emit if the redirect property has a value
var tempDoc = {};
tempDoc[this.userId] = 1;
emit(this.redirect, {
user: tempDoc,
click: 1
});
}
};
Map output / Reduce input
The reduce function will run for each unique key
reduce(key, value)
So think of the data being emitted from the Map function like this. Notice how we have thrown away all data that we aren't interested in using. It's gone.
[
{
"http://emberjs.com/": [
{
user: {
"1": 1
},
click: 1
},
{
user: {
"1": 1
},
click: 1
},
{
user: {
"2": 1
},
click: 1
}
]
},
{
"http://documentcloud.github.com/backbone/": [
{
user: {
"3": 1
},
click: 1
},
{
user: {
"3": 1
},
click: 1
}
]
}
]
Reduce function
function(key, values) {
var summary = {
users: {},
total: 0
};
values.forEach(function (doc) {
// increment total for every value
summary.total += doc.click;
// Object.extend() will only add keys from the right object that do not exist on the left object
Object.extend(summary.users, doc.user);
});
return summary;
};
Reduce output
Each key now has a total and an object containing the unique list of user ID's as keys.
[
{
"http://emberjs.com/": {
users: {
"1": 1,
"2": 1
},
total: 3
}
},
{
"http://documentcloud.github.com/backbone/": {
users: {
"3": 1
},
total: 2
}
}
]
Finalize function
Since in this case I only actually want the unique count, rather than the data of which user clicked, the finalize function is used to make one more pass of the data and tidy up a bit.
Since I don't want to cut out any credit, even for a simple function, I should point out that Ren credited this post for the countKeys() function
function(key, value) {
// you can specify helper functions within any step
var countKeys = function(obj) {
var count = 0;
for(var i in obj) {
if (obj.hasOwnProperty(i)) {
count++;
}
}
return count;
}
return {
redirect: value.redirect,
total: value.total,
unique: countKeys(value.users)
}
}
Final output
[
{
"http://emberjs.com/": {
redirect: "http://emberjs.com/",
total: 3,
unique: 2
}
},
{
"http://documentcloud.github.com/backbone/": {
redirect: "http://documentcloud.github.com/backbone/",
total: 2,
unique: 1
}
}
]
It might seem a bit odd to return the URL twice, but in my particular use case I have a web service looping through the result and outputting only an array of the values for easy use with a JavaScript webapp.
Implementing in PHP
Most of my connections to Mongo databases are wrapped in PHP web services (pure JS one day!) so I might as well include a simple PHP example. I've just included stub heredocs since the functions themselves are included above
<?php
$conn = new Mongo();
$db = $conn->maildb;
$map = <<<EOJS
function(){}
EOJS;
$reduce = <<<EOJS
function(){}
EOJS;
$finalize = <<<EOJS
function(){}
EOJS;
$command = array(
'mapreduce' => 'click_tracker' // collection name
'map' => new MongoCode($map),
'reduce' => new MongoCode($reduce),
'finalize' => new MongoCode($finalize),
'out' => array('inline' => 1) // return output as opposed to creating a new collection
);
$results = $db->command($command);
$conn->close();
// For the situation I described at the end of the last section
// I'm 'flattening' the results for use with a webapp
$output = array();
foreach($results['results'] as $value) {
$output[] = $value['value'];
}
return $output;
?>
Filtering the collection before performing the MapReduce
You can set query criteria to control the input to the MapReduce
<?php
$command['query'] = array(
'remoteAddr' => '0.0.0.0'
);
// and/or
$command['query'] = array(
'redirect' => new MongoRegex('/ember/i')
);
// etc...
?>
I hope that all makes sense. This blog is brand new in the last couple of days and I haven't added commenting functionality yet, so if you've found this somehow and have any corrections I will receive them gladly to @joevallender
comments powered by Disqus