Suppose you want to detect man-in-the-middle attacks – a typical
case of cybercrime in which the legit user of a system is taken over by a malicious
attacker (either human or computer-based).
Which machine learning would best match said problem? The major
issue is that this a case of anomaly detection – one wants to detect a
condition that is usually very rare. No or very few public datasets describe malicious
transactions of this type, so if one wants to construct a training set for a
machine learning algorithm, one has a large number of negative examples (legit
transactions) and very few, if any, positive examples. How to deal with this issue?
A solution is the one modelled
in my paper “Antifragility =
Elasticity + Resilience + Machine Learning: Models and Algorithms for Open
System Fidelity”: to encode the normal behaviour and detect anomalies
as driftings from the normal behaviour. A way to do this is by means of Markov
chains – a powerful mathematical model that has been used, e.g., to generate
random text in the style of a reference text. The idea is to use a reference Corpus
– for instance, the Plays of Shakespeare – and feed them into a program that
approximates the probabilities of observing a given word, given that that word
is preceded by a number of other words. As an oversimplified example, one would
calculate the probability that « to » and « be » are
followed by « or », the probability that « be » and « or »
are followed by « not », the probability that « or » and
« not » are followed by « to », and so forth:
P { ( « to », « be » ) =>
« or » } = some value x
P { ( « be », « or » ) => « not » } = some value y
etc
P { ( « be », « or » ) => « not » } = some value y
etc
Those probabilities would capture the peculiar way in which
sentences are constructed in the reference Corpus – its identity, if you like. The random text would then be generated by
taking a “random walk” from an initial sequence to a final one – as if the
probabilities were the orbits of a dynamic system on words (namely, through the
same approach I used in my paper “Permutation Numbers”).
How can the above mathematical model be of use to solve of
our anomaly detection problem? It’s actually quite simple: one has to construct
an analogy between the sentences in a reference Corpus and the behaviour exercised
by the user of a software system. What does a user of a software system
typically does? S/he uses a graphical user interface to specify actions that
s/he requires to be executed. How is this done in practice? By visiting the
widgets of the user interface, pressing keys, and so forth. Think of those actions
as words in a sentence, and the interaction sessions of user U as the sentences
that user U “says”. Then we can use the Markov approach and approximate the
probabilities that a given number of user interface actions be followed by
another such action. Those probabilities then encode… the reference behaviour
of user U when using a given interface!
For each user U then one could create a set of reference
probabilities I(U), representing the stereotypical behaviour of U – namely, its
identity. Every time U logs into the system, I(U) would then be loaded and used
as a reference identity. The new interaction session would then be used to construct
a new set of probabilities, C(U). The distance of I(U) from C(U) would then quantitatively
express the experienced anomaly. When said distance would become larger than a
given threshold, the system would declare a case of man-in-the-middle attack.
Note that the same approach could be used to detect a slow drifting
away of C(U) from I(U). This would represent a monotonic change in the user
stereotype, which could be a sign the user developing fatigue of perception/
cognitive disorders.
The pictures below describes the approach via a collaboration diagram. User U expresses actions such as "visualize flights"; an instrumented user interface received the action and forwards the information to an "interaction logger" that creates a suitable representation for the "sentence" that U is building. The Markovian analyzer is then notified so as to update the probabilities describing the "sentence". At the end, those probabilities are stored into I(U).
At run-time, a new set of reference probabilities, C(U), is constructed and compared with I(U). An alarm is issued when the discrepancy between those two values becomes too large: