Final-Connect-Image.jpg

Intelligent Algorithms and Feature Design

Posted by Yan Glina   |   January 27, 2015

data-science-2If you live and breathe at the intersection of Cyber Security and Data Science, you have probably seen Alexandre Pinto’s DefCon22 talk, #SecureBecauseMath (https://www.youtube.com/watch?v=TYVCVzEJhhQ).  In this talk, Alex makes great points regarding some blatantly poor yet commonplace practices. #MathIsAwesome, but pushing not-quite-real science by overeager marketing departments, or being an overly receptive, starry-eyed audience waiting for pronouncements from the next super-genius, are still problems.  Alex also alludes to some things that are of immediate consequence to algorithms and Machine Learning researchers operating in the security space.  For example, Feature Design.

Feature Design is the art of creating useful variables (features) -- numerical or otherwise -- that are meant to capture the salient details of the patterns contained in data.  Feature Design is hard.  Machine Learning is not magic, and there is no guarantee that a pattern that is barely there can be represented well enough for an algorithm to latch on to.  This remains difficult, no matter how much one begins to believe in Deep Learning and tera-feature classification with on-line algorithms.  Any Data Scientist with experience in Machine Learning will tell you that Feature Design is where systems are made (or broken).  An algorithm isn’t likely to figure out the multi-dimensional correlations without a great deal of well-labeled data  -- reliable labels for Supervised Learning are extremely important.  However, (a) well-designed feature(s) may bring that connection out more easily.  So, apply some human ingenuity and a little bit of prototype-level elbow grease, and suddenly performance improves by leaps and bounds.  

There is another side of the coin, however -- does the performance jump reflect the actual skill of the learning system and its generalization capability, or is it just overtraining in disguise?  Notable specialists, including Trevor Hastie, Robert  Tibshirani (of Stanford University and authors of “The Elements of Statistical Learning”), and John Langford (author of Vowpal Wabbit), speak at great length about hidden overtraining (http://hunch.net/?p=22).  A simple mistake, such as adjusting the features after evaluating on the Test Set can yield improved results in evaluation, but fail hard in real operation.  There are some technically simple (but conceptually non-trivial) solutions to this, but they require that the researcher or engineer at the very least recognize that this type of error is occurring.

So, #math, i.e., the practice of Machine Learning / Statistical Learning, may indeed be the answer, but it doesn’t absolve us from the responsibility of performing the due diligence and doing science the RIGHT way.

Topics: Cyber Security, data science, feature design

Subscribe to Email Updates

Posts by Topic

see all