Naive Bayes and Frequency Table in Javascript Math Toolkit

I’ve started the process of adding Bayesian analysis into the Javascript Math Toolkit.  As you might expect, the first addition to the Bayesian family is the naive algorithm, which is driven by a frequency table.  FrequencyTable is a separate class that has some use in its own right, but the primary benefit of the class is to feed the naive Bayes analysis.

The FrequencyTable has classes for rows and predictors as columns with the ability to assign all the data from arrays or add to the overall data incrementally.  The latter is useful for interactive experiments and real-time analysis.

Bayes is known as a classifier, whose function is to make a distinction between two or more classes based on predictors.  Naive Bayes presumes independence among the predictors, a simplifying assumption that relaxes memory and computational requirements.  Bayes does very well when the independence presumption is accurate and there are well-defined distinctions between classes, i.e. spam or not-spam, or action-A or not-action-A.

Programmers are unlikely to be familiar with this algorithm, especially the difference between prior and posterior probabilities, so it might be best to look at a very simple example to illustrate the operation of the two classes.  The following data was taken from this online example.

Suppose the goal is to classify various types of fruit as Banana, Orange, or Other based on whether or not the fruit is observed to be Long, Sweet, or Yellow (or any combination thereof).  Suppose 1000 observations are made and the following data is collected:

Banana: Long (400), Sweet (350), Yellow (450) – Total (500)

Orange: Long (0), Sweet (150), Yellow (300) – Total (300)

Other: Long (100), Sweet (50), Yellow (150) – Total (200)

The totals are the total number of observations that were deemed to fit a specific class while the individual predictor values were the number of times that characteristic was observed.  So, there may be a single tally added to the Banana class, but that ‘sample’ was observed to be both Long and Yellow.

The FrequencyTable for this data would be assigned as follows:

var rows = [ "Banana", "Orange", "Other" ];
var cols = [ "Long", "Sweet", "Yellow" ];
var rowTotals = [ 500, 300, 200 ];
var data = [];
var data.push( [400, 350, 450] );
var data.push( [0 , 150, 300] );
var data.push( [100, 150, 50 ] );
__table.fromArray( data, rows, cols, rowTotals );

Suppose the goal is to use the predictors, Long, Sweet, Yellow to classify if a new observation was a Banana, based only on the above-supplied information, i.e. P(B | L,S,Y).  The fundamental assumption is that Long, Sweet, and Yellow are completely independent, that is a Long fruit has no expectation of being correlated with also being Sweet or Yellow.  Although this may be the case in practice for a small number of predictors, dependence has a nasty habit of creeping in when more predictors are added.  It is tempting to ‘overfit’ the analysis by adding too many predictors in order to force compliance with preconceived notions.

The JS Math Toolkit FrequencyTable has a method that allows for quick column-column tests for correlation.  If found, it is easy to remove columns.  The table is also auto-validating, i.e. checks are made for correctness of data and any occurrence of zero frequency is set to one occurrence.  The validate() method returns an integer value indicating the result of validation.

The above-required naive Bayesian analysis is accomplished by:

prob = __bayes.naive( "Banana", ["Long", "Sweet", "Yellow"] );

The frequency table is first validated.  In this example, we feel confident of the general quality of data, but wish to run the validation to implement the Laplace correction.  The table is then assigned as a data provider to the Bayes class instance.  Since a reference to that table is used, it is possible to update the table in real-time and recompute classifiers on-the-fly.  When finished, call the clear() method to prepare the table for garbage collection.

The table may be validated inside the Bayes class at the caller’s discretion.  Another option is to compute only the numerator of the underlying equation, which is useful for instances where multiple classes are checked across the same ‘evidence’ or prior probabilities.  In this case, the denominator is the same and only the numerators need be compared.

For example, suppose the question is whether or not Long, Sweet, and Yellow indicates a Banana or an Other fruit.  Two calls to the naive() method would cause the exact same denominator to be computed twice.  Since only the numerators need to be compared, the method would be called with optional arguments,

var prob1 = __bayes.naive( “Banana”, [“Long”, “Sweet”, “Yellow”], false, true );
var prob2 = __bayes.naive( “Other”, [“Long”, “Sweet”, “Yellow”], false, true );

This generates the same results as in the referenced example, from which it may be concluded that there is very strong evidence that Long, Sweet, Yellow indicates a Banana. The return Object from the naive() method contains two properties.  The ‘p’ property is the requested probability (or numerator-only) and the ‘gain’ property is the Kononenko information gain.

There is another optional argument that is very useful for larger examples.  The textbook formula involves multiplication of probabilities, all of which are one or smaller, and some of which may be very small.  There can be numerical issues (including underflow) with multiplication of many small numbers.  If the user is worried about such issues, then an alternate formulation involving a summation of logarithmic terms may be substituted for the standard equation.

This is, of course, only the first step in a longer path of Bayesian techniques; plenty to keep me busy on evenings and weekends :)

Comments are closed.