Core team:

Zhenyu Lu
University of Vermont
Department of Mathematics and Statistics, Professor
None
Most recent papers:
Active learning through adaptive heterogeneous ensembling.
Zhenyu Lu, Xindong Wu, Joshua Bongard. IEEE Transactions on Knowledge and Data Engineering, 368-381, 27, 2015.[pdf] [journal page]
Abstract:
An open question in ensemble-based active learning is how to choose one classifier type, or appropriate combinations of multiple classifier types, to construct ensembles for a given task. While existing approaches typically choose one classifier type, this paper presents a method that trains and adapts multiple instances of multiple classifier types toward an appropriate ensemble during active learning. The method is termed adaptive heterogeneous ensembles (henceforth referred to as AHE). Experimental evaluations show that AHE constructs heterogeneous ensembles that outperform homogeneous ensembles composed of any one of the classifier types, as well as bagging, boosting and the random subspace method with random sampling. We also show in this paper that the advantage of AHE over other methods is increased if (1) the overall size of the ensemble also adapts during learning; and (2) the target data set is composed of more than two class labels. Through analysis we show that the AHE outperforms other methods because it automatically discovers complementary classifiers: for each data instance in the data set, instances of the classifier type best suited for that data point vote together, while instances of the other, inappropriate classifier types disagree, thereby producing a correct overall majority vote.
Cite: [bibtex]
[edit database entry]
Crowdsourcing Predictors of Behavioral Outcomes.
Joshua Bongard, Paul Hines, Dylan Conger, Peter Hurd, Zhenyu Lu. 2013.[pdf] [arXiv]
Abstract:
Generating models from large data sets -- and determining which subsets of data to mine -- is becoming increasingly automated. However choosing what data to collect in the first place requires human intuition or experience, usually supplied by a domain expert. This paper describes a new approach to machine science which demonstrates for the first time that non-domain experts can collectively formulate features, and provide values for those features such that they are predictive of some behavioral outcome of interest. This was accomplished by building a web platform in which human groups interact to both respond to questions likely to help predict a behavioral outcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result of this cooperative behavior also leads to models that can predict user's outcomes based on their responses to the user-generated survey questions. Here we describe two web-based experiments that instantiate this approach: the first site led to models that can predict users' monthly electric energy consumption; the other led to models that can predict users' body mass index. As exponential increases in content are often observed in successful online collaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discovery and insight into the causal factors of behavioral outcomes.
Cite: [bibtex]
[edit database entry]
Use of an artificial neural network to predict head injury outcome.
Anand I. Rughani, Bruce I. Tranmer, Joshua Bongard, Michael A. Horgan, Paul L. Penar, Travis M. Dumont, Zhenyu Lu. Journal of neurosurgery, 585-590, 113, 2010.[pdf] [journal page]
Abstract:
The authors describe the artificial neural network (ANN) as an innovative and powerful modeling tool that can be increasingly applied to develop predictive models in neurosurgery. They aimed to demonstrate the utility of an ANN in predicting survival following traumatic brain injury and compare its predictive ability with that of regression models and clinicians.
Cite: [bibtex]
[edit database entry]
Ensemble Pruning via Individual Contribution Ordering.
Joshua Bongard, Xindong Wu, Xingquan Zhu, Zhenyu Lu. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 871-880, , 2010.[pdf] [journal page]
Abstract:
An ensemble is a set of learned models that make decisions collectively. Although an ensemble is usually more accurate than a single learner, existing ensemble methods often tend to construct unnecessarily large ensembles, which increases the memory consumption and computational cost. Ensemble pruning tackles this problem by selecting a subset of ensemble members to form subensembles that are subject to less resource consumption and response time with accuracy that is similar to or better than the original ensemble. In this paper, we analyze the accuracy/diversity trade-off and prove that classifiers that are more accurate and make more predictions in the minority group are more important for subensemble construction. Based on the gained insights, a heuristic metric that considers both accuracy and diversity is proposed to explicitly evaluate each individual classifier’s contribution to the whole ensemble. By incorporating ensemble members in decreasing order of their contributions, subensembles are formed such that users can select the top p percent of ensemble members, depending on their resource availability and tolerable waiting time, for predictions. Experimental results on 26 UCI data sets show that subensembles formed by the proposed EPIC (Ensemble Pruning via Individual Contribution ordering) algorithm outperform the original ensemble and a state-of- the-art ensemble pruning method, Orientation Ordering (OO) [16].
Cite: [bibtex]
[edit database entry]
Adaptive Informative Sampling for Active Learning.
Joshua Bongard, Xindong Wu, Zhenyu Lu. Proceedings of the 2010 SIAM International Conference on Data Mining, 894-905, , 2010.[pdf] [journal page]
Abstract:
Many approaches to active learning involve periodically training one classifier and choosing data points with the lowest confidence, but designing a confidence measure is nontrivial. An alternative approach is to periodically choose data instances that maximize disagreement among the label predictions across an ensemble of classifiers. Many classifiers with different underlying structures could fit this framework, but some ensembles are more suitable for some data sets than others. The question then arises as to how to find the most suitable ensemble for a given data set. In this work we introduce a method that begins with a heterogeneous ensemble composed of multiple instances of different classifier types, which we call adaptive informative sampling. The algorithm periodically adds data points to the training set, adapts the ratio of classifier types in the heterogeneous ensemble in favor of the better classifier type, and optimizes the classifiers in the ensemble using stochastic methods. Experimental results show that the proposed method performs consistently better than homogeneous ensembles. Comparison with random sampling and uncertainty sampling shows that the algorithm effectively draws informative data points for training.
Cite: [bibtex]
[edit database entry]
Active Learning with Adaptive Heterogeneous Ensembles.
Joshua Bongard, Xindong Wu, Zhenyu Lu. Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on, 327-336, , 2009.[pdf] [journal page]
Abstract:
One common approach to active learning is to iteratively train a single classifier by choosing data points based on its uncertainty, but it is nontrivial to design uncertainty measures unbiased by the choice of classifier. Query by committee [1] suggests that given an ensemble of diverse but accurate classifiers, the most informative data points are those that cause maximal disagreement among the predictions of the ensemble members. However the method for finding ensembles appropriate to a given data set remains an open question. In this paper, the random subspace method is combined with active learning to create multiple instances of different classifier types, and an algorithm is introduced that adapts the ratio of different classifier types in the ensemble towards better overall accuracy. Here we show that the proposed algorithm outperforms C4.5 with uncertainty sampling, Naive Bayes with uncertainty sampling, bagging, boosting and the random subspace method with random sampling. To the best of our knowledge, our work is the first to adapt the ratio of classifiers in a heterogeneous ensemble for active learning.
Cite: [bibtex]
[edit database entry]
Exploiting multiple classifier types with active learning.
Zhenyu Lu, Joshua Bongard. Proceedings of the 11th Annual conference on Genetic and evolutionary computation, 1905-1906, , 2009.[pdf] [journal page]
Abstract:
Many approaches to active learning involve training one classifier by periodically choosing new data points about which the classifier has the least confidence, but designing a confidence measure without bias is nontrivial. An alternative approach is to train an ensemble of classifiers by periodically choosing data points that cause maximal disagreement among them. Many classifiers with different underlying structures could fit this framework, but some classifiers are more suitable for different data sets than others. The question then arises as to how to find the most suitable classifier for a given data set. In this work, an evolutionary algorithm is proposed to address this problem. The algorithm starts with a combination of artificial neural networks and decision trees, and iteratively adapts the ratio of the classifier types according to a replacement strategy. Experiments with synthetic and real data sets show that when the algorithm considers both fitness and classifier type for replacement, the population becomes saturated with accurate instantiations of the more suitable classifier type. This allows the algorithm to perform consistently well across data sets, without having to determine a priori a suitable classifier type.
Cite: [bibtex]
[edit database entry]
Informative sampling for large unbalanced data sets.
Zhenyu Lu, Anand I. Rughani, Bruce I. Tranmer, Joshua Bongard. Proceedings of the 10th annual conference companion on Genetic and evolutionary computation, 2047-2054, , 2008.[pdf] [journal page]
Abstract:
Selective sampling is a form of active learning which can reduce the cost of training by only drawing informative data points into the training set. This selected training set is expected to contain more information for modeling compared to random sampling, thus making modeling faster and more accurate. We introduce a novel approach to selective sampling, which is derived from the Estimation-Exploration Algorithm (EEA). The EEA is a coevolutionary algorithm that uses model disagreement to determine the significance of a training datum, and evolves a set of models only on the selected data. The algorithm in this paper trains a population of Artificial Neural Networks (ANN) on the training set, and uses their disagreement to seek new data for the training set. A medical data set called the National Trauma Data Bank (NTDB) is used to test the algorithm. Experiments show that the algorithm outperforms the equivalent algorithm using randomly-selected data and sampling evenly from each class. Finally, the selected training data reveals which features most affect outcome, allowing for both improved modeling and understanding of the processes that gave rise to the data.
Cite: [bibtex]
[edit database entry]