Wednesday 2 January 2019

Anomaly detection as detecting identity drifting

Suppose you want to detect man-in-the-middle attacks – a typical case of cybercrime in which the legit user of a system is taken over by a malicious attacker (either human or computer-based).
Which machine learning would best match said problem? The major issue is that this a case of anomaly detection – one wants to detect a condition that is usually very rare. No or very few public datasets describe malicious transactions of this type, so if one wants to construct a training set for a machine learning algorithm, one has a large number of negative examples (legit transactions) and very few, if any, positive examples. How to deal with this issue?

A solution is the one modelled in my paper “Antifragility = Elasticity + Resilience + Machine Learning: Models and Algorithms for Open System Fidelity”: to encode the normal behaviour and detect anomalies as driftings from the normal behaviour. A way to do this is by means of Markov chains – a powerful mathematical model that has been used, e.g., to generate random text in the style of a reference text. The idea is to use a reference Corpus – for instance, the Plays of Shakespeare – and feed them into a program that approximates the probabilities of observing a given word, given that that word is preceded by a number of other words. As an oversimplified example, one would calculate the probability that « to » and « be » are followed by « or », the probability that « be » and « or » are followed by « not », the probability that « or » and « not » are followed by « to », and so forth: 

P { ( « to », « be » ) => « or » } = some value x
P { ( « be », « or » ) => « not » } = some value y

Those probabilities would capture the peculiar way in which sentences are constructed in the reference Corpus – its identity, if you like. The random text would then be generated by taking a “random walk” from an initial sequence to a final one – as if the probabilities were the orbits of a dynamic system on words (namely, through the same approach I used in my paper “Permutation Numbers”).

How can the above mathematical model be of use to solve of our anomaly detection problem? It’s actually quite simple: one has to construct an analogy between the sentences in a reference Corpus and the behaviour exercised by the user of a software system. What does a user of a software system typically does? S/he uses a graphical user interface to specify actions that s/he requires to be executed. How is this done in practice? By visiting the widgets of the user interface, pressing keys, and so forth. Think of those actions as words in a sentence, and the interaction sessions of user U as the sentences that user U “says”. Then we can use the Markov approach and approximate the probabilities that a given number of user interface actions be followed by another such action. Those probabilities then encode… the reference behaviour of user U when using a given interface!
For each user U then one could create a set of reference probabilities I(U), representing the stereotypical behaviour of U – namely, its identity. Every time U logs into the system, I(U) would then be loaded and used as a reference identity. The new interaction session would then be used to construct a new set of probabilities, C(U). The distance of I(U) from C(U) would then quantitatively express the experienced anomaly. When said distance would become larger than a given threshold, the system would declare a case of man-in-the-middle attack.

Note that the same approach could be used to detect a slow drifting away of C(U) from I(U). This would represent a monotonic change in the user stereotype, which could be a sign the user developing fatigue of perception/ cognitive disorders.

The pictures below describes the approach via a collaboration diagram. User U expresses actions such as "visualize flights"; an instrumented user interface received the action and forwards the information to an "interaction logger" that creates a suitable representation for the "sentence" that U is building. The Markovian analyzer is then notified so as to update the probabilities describing the "sentence". At the end, those probabilities are stored into I(U).

At run-time, a new set of reference probabilities, C(U), is constructed and compared with I(U). An alarm is issued when the discrepancy between those two values becomes too large:

No comments:

Post a Comment