Peter Schulam

Jul 8, 2013


Statistics is a big field. At the highest level, we can organize concepts under the frequentist or Bayesian schools of thought, but the concepts that we might typically use to differentiate the two aren't the aims of performing an analysis. Instead, the differences tend to be philosophical. At the most fundamental level, for example, frequentists interpret the probability of an event as being an expression of the relative proportion of times that an event occurs if the "experiment" is repeated an infinite number of times. Alternatively, Bayesians view probability as measuring one's beliefs that an event will occur---how we quanity this precisely is difficult, I think, and might be an interesting topic to explore in another post.

To return to the idea that the aim of an analysis is not what differentiates frequentist from Bayesian statistics, I think it's natural to ask the following two questions. First, how can we match or equate procedures in the two camps that are essentially answering the same inferential questions? And second, how do the philosophical differences that I outlined above affect the mathematical analyses that support the procedures, and what assumptions are we implicitly accepting as a result? In this essay, I'll be exploring these two questions, and at the end I'd like to reflect on which set of assumptions are more appropriate for problems typically solved by computer scientists working in machine learning. This will be a living document, and will be updated as I continue to read and reevaluate my understanding.

Statistical Inference

Jun 28, 2013

Eric S. Raymond describes orthogonality in the context of programming languages:

Orthogonality is one of the most important properties that can help make even complex designs compact. In a purely orthogonal design, operations do not have side effects; each action (whether it's an API call, a macro invocation, or a language operation) changes just one thing without affecting others. There is one and only one way to change each property of whatever system you are controlling.

I think that a similar principle can be applied to the choices a computational scientist makes when selecting her tools. Learning a new programming language, text editor, IDE, etc. requires a significant investment of time and effort before it becomes something that actually helps her become more productive.

If we look at productivity as a high-dimensional space, a researcher wants to select tools that span the space. That is, we want a productivity "basis." For example, in my mind, it doesn't make sense to invest lots of time learning both Ruby and Python. They both serve similar purposes as high level scripting languages for writing general, short programs. In my case, Python makes more sense because the well developed numpy stack allows the language to play the role of an additional basis vector as an environment for rapidly prototyping numerical algorithms (i.e. a MATLAB replacement).

My current lineup:

  • bash is my primary shell and doubles as a scripting language for automating tasks that require creating, renaming, and deleting files and directories.

  • python is my workhorse. As a general purpose scripting language it's excellent. The syntax is lightweight and natural, which minimizes the cognitive load of remembering syntax and semantics. The numpy/scipy stack used within iPython makes it a great environment for prototyping ideas.

  • R together with RStudio is my modeling and visualization environment. Nothing beats R as a statistics and machine learning workbench, and the ggplot2 graphics library is easy to use and produces some of the most beautiful graphics around. The vibrant community and active CRAN library repository is ideal for research.

  • C is my compiled language of choice when speed matters. Both python and R have great support for interfacing with C which means I can smoothly transition from working in a high-level language to something closer to the metal. C is ideal for this purpose because of its conceptual simplicity. It's not often that I need to program at such a low level, and when I do it's nice to avoid the trouble of recalling the intricacies of something as complex as C++ (that's not meant to be a bash on C++).

May 30, 2013

I'll be using this document to hash out and help clarify some ideas that Roni and I have been discussing about uniform language model evaluation schemes. This idea came from our work on building language models for source code. We were trying to think how we could design an intuitive development and evaluation framework for software engineering researchers interested in building their own language models, and we realized that it's actually not entirely clear how "experts" consistently evaluate models. Read on if this sounds interesting to you.


Language models are a vital component of many modern language technologies. To the extent that a language model is able to capture the "naturalness" of a sequence of words, such models can be used in machine translation for reranking sentences hypothesized by the translation formalism. They are also used to help resolve phonetic ambiguities in a speech recognition system by preferring transcriptions that are more probable under the language model.

Despite their importance in modern language processing, progress in language modeling methodology has not advanced far beyond using simple n-gram models trained on massive amounts of text. In part, this may be due to the fact that claimed advances in language modeling techniques are difficult to confirm because there are many factors that can affect the performance of a model in a particular evaluation scenario. For example, n-gram models that use different vocabularies are not comparable, even when evaluated on the same test set.

We propose a framework in which any model that adheres to a very simple interface can be correctly and uniformly evaluated. Underlying the framework is the idea of scoring a sequence of characters, with no knowledge of what forms a token. A language model is queried with a particularly history, and responds with a set of predictions that are checked by the evaluation framework.

Our goal is to build a web-based API through which anyone in the world can evaluate a language model on a particular dataset. We hope to organize and maintain a database of entries by different contestants, allowing language modeling researchers to easily evaluate new ideas and share their performance with others. Our hope is that such an interface will make language modeling evaluations uniform across research groups, and allow useful contributions to be more effectively demonstrated and shared.