Tabular Analysis In Javascript Math Toolkit

I recently added polynomial least squares to the data analytics capability of the Javascript Math Toolkit.  Next in line is analysis of tabular data.  A data table in JS Math Toolkit is ordered by columns with the first row consisting of character labels for each category of data.  If we take the usedcars.csv file as an example, from the Book ‘Machine Learning in R’ by Lantz, then the column headers would be,

“year”, “model”, “price”, “mileage”, “color”, “transmission”

Tabular data may be character, numeric, or boolean.  The data set may be loaded in any means desired into a two-dimensional array (with the first row consisting of the column headers) using the fromArray method of the Table class.

var types = [ __table.NUMERIC, __table.CHARACTER, __table.NUMERIC, __table.NUMERIC, __table.CHARACTER, __table.CHARACTER ];

__table.fromArray( data, types );

In addition to a variety of common statistics on columns of data, quantiles may also be directly computed. Quintiles, for example, are computed by

var q = __table.get_quantiles("price", 0.2);

One-way tables (that count number of occurrences of unique items in a data column) are very easy. A return Object is provided that can be converted into a 2D table for convenience,

var obj = __table.oneWayTable("year");
var tbl = __table.__tblToArray(obj);

from which the output is

2000, 3
2001, 1
2002, 1
2003, 1
2204, 3
2005, 2
2006, 6
2007, 11
2008, 14
2009, 42
2010, 49
2011, 16
2012, 1

If you prefer output in percentages,

obj = __table.oneWayTable("color", true);

Yellow 2.00
Gray 10
Silver 21.33
White 10.67
Blue 11.33
Black 23.33
Green 3.33
Red 16.67
Gold 0.67

One-way tables are cool, but the real fun is cross-tab analysis.  Two methods are provided for this type of analysis and both produce the same output.  The crossTable() method allows one column to be analyzed vs. another (both are character data).  The dependent category may be further organized into groups.  For example, consider the example in Lantz where car model is analyzed vs. groups of colors.  In the book, colors were separated into conservative and non-conservative colors to determine if there was a possible correlation between model of vehicle and color selected.  In JS Math Toolkit, this same analysis can be performed with a single method call,

var output = __table.crossTable("model", "color", 
["Black Silver White Gray", "Blue Gold Green Red Yellow"], 
["Simple-Color", "Bold-Color"] );

The entire collection of colors was divided into two groups and the cross-table analysis was done by group, not by unique color. The output column names are provided at the end of the argument list.

The output consists of four properties:

chi2 – Total table chi-squared value
df – Table degrees of freedom
q – Q-value from chi-squared or probability that table results were obtained by chance
table – Output table with cell count, row and column percentages, and percentage of cell count vs. entire table count.

The cell chi-squared may be added in the future as part of the output.

In contrast, the crossTabulation() method performs a traditional cross-table or contingency table analysis.  As an example, consider an example that can be found online, where city of residence is studied vs. favorite baseball team.  Cell counts indicate responses from a survey, for example.

City  Blue Jays Red Socks Yankees

Boston 11 33 7

Montreal 23 14 9

Montpellier 26 60 30

A table is created with column labels “City”, “Blue Jays”, “Red Socks”, Yankees”.  The remainder of the data is supplied to the table via the fromArray() method.  Since the data is already organized for a full cross-table analysis, the method call is very simple,

output = __table.crossTabulation();

Partial output is:

CrossTabulation of city vs. baseball teams
Table degrees of freedom:  4
Total chi-squared:  19.35140903152151
Q-value:  0.0006703343353674507

along with the table of cell summaries.

From the chi-squared analysis, there is less than a one in one thousand chance the that table results were obtained by chance, which indicates the relationship between city and favorite baseball team warrants further study.

From my perspective, the Table class makes it very easy for me to provide this type of statistical analysis to a development team. With the Javascript Math Toolkit and some side consulting from myself, all that should be necessary is to organize the data (check outliers/duplicates) and then format the output into a data grid or other visualization.

Update:  The Table class now includes methods to auto-normalize or z-score columns and split the internal table into 2D arrays for training and test data sets.

Comments are closed.