MetaAware Identifying Metamorphic Malware, Hacking and IT E-Book Dump Release

[ Pobierz całość w formacie PDF ]
MetaAware: Identifying Metamorphic Malware
∗
Qinghua Zhang, Douglas S. Reeves
Cyber Defense Laboratory, Computer Science Department
North Carolina State University, Raleigh, NC 27695-8207
{
qzhang2, reeves
}
@ncsu.edu
Abstract
quence of bytes that is completely characteristic of a spe-
ciﬁc malware instance. If a program upon inspection is
found to contain such a byte sequence, it is suspected of be-
ing infected by malware. Deriving signatures of new attacks
is a major function of the virus-checking and intrusion-
detection industries.
However, signature based approaches are essentially
syntactic and lack insight into the programs’ semantics.
Christodorescu and Jha [10] pointed out that such detection
methods can be easily defeated by
metamorphism
, which
uses code obfuscation techniques to transform the repre-
sentation of programs. Metamorphic malware can obfus-
cate their entire code in a variety of ways, such as con-
trol ﬂow transposition, substitution of equivalent instruc-
tions, variable renaming, etc. [9, 26]. This creates an arms
race between the metamorphic malware writers (or obfus-
cation engines such as Mistfall, Win32/Simile, and RPME
as pointed out in [26]), the signature writers, and the own-
ers/administrators of the threatened computers or devices,
which must be protected.
It is also quite common that new malware variants (or
mutants
) rapidly evolve from old malware, to which new
functions have been added, or existing functionalities have
been tweaked [15]. For example, the VX Heavens website
[7] provides access to thousands of malware variants in a
variety of different categories. For each malware variant,
a signature must be identiﬁed, packaged, and downloaded
to the base of users expecting protection from the new at-
tack. The huge range of possible variants, and the speed
with which they appear, makes this a less and less prac-
tical approach. Zmist is an advanced metamorphic virus
that demonstrates a set of polymorphic and metamorphic
code writing skills which include entry-point obscuring,
randomly using an additional polymorphic decryptor, code
permutation and code integration [26].
This paper offers a semantic characterization of pro-
grams, and a method of matching such characterizations,
as a basis for malware detection that is resilient to many
commonly-used obfuscation techniques. Generally the
problem of determining whether a program will exhibit
Detection of malicious software (malware) by the use of
static signatures is often criticized for being overly simplis-
tic. Available methods of obfuscating code (so-called meta-
morphic malware) will invalidate the use of a ﬁxed signa-
ture, without changing the harmful effects of the software.
This paper presents a new approach for recognizing meta-
morphic malware. The method uses fully automated static
analysis of executables to summarize and compare program
semantics, based primarily on the pattern of library or sys-
tem functions which are called.
The proposed method has been prototyped and evaluated
using randomized benchmark programs, instances of known
malware program variants, and utility software available
in multiple releases. The results demonstrate three impor-
tant capabilities of the proposed method: (
a
) it does well
at identifying metamorphic variants of common malware;
(
b
) it distinguishes easily between programs that are not
related; and, (
c
) it can identify and detect program varia-
tions, or code reuse. Such variations can be due to insertion
of malware (such as viruses) into the executable of a host
program. We argue that this method of metamorphic code
detection will be difﬁcult for malware writers to bypass.
1
Introduction
Personal computers and other devices attached to the In-
ternet must be protected from the enormous amount of ma-
licious software which attempts to inﬁltrate such systems
via the network. Examples of such malicious software,
termed
malware
, include viruses, worms, spyware, and tro-
jans. Malware instances have a variety of malicious pur-
poses, effects, and penetration methods [21].
Signature based malware detectors have been popular
and successful. An example of a signature would be a se-
∗
This work is supported by the National Science Foundation (NSF)
under grant CNS-0627505.
2
Related Work
a certain behavior is undecidable. Therefore, this paper
presents an approximation approach to address the problem.
Essentially, this paper proposes a code pattern genera-
tion method which characterizes a code fragment’s seman-
tics, or functionality, based on system calls executed by the
malware. It is based on static analysis of control and data
ﬂow of call traces, which include the calling instruction,
as well as the instructions that prepare parameters for the
system call. This paper also proposes a pattern matching
method, which tells whether two patterns semantically re-
semble each other. Let the characterization of a code frag-
ment be represented as
f
:
code -> pattern
, and the
matching function as
Malware detection has been an important research topic
for quite some time [25, 20]. Some recent work has focused
on the problem of metamorphic and polymorphic malware
that uses code obfuscation techniques to bypass static signa-
ture based approaches. Christodorescu
et al
. [9] presented a
unique view of malicious code detection as an obfuscation-
deobfuscation game, and used control ﬂow graph compar-
ison to detect some simple obfuscation techniques often
used by virus writers. The same authors [12] later proposed
to use instruction semantics to formalize templates of cer-
tain malicious behavior, such as the occurrence of decryp-
tion loops in polymorphic (self-decrypting) malware. They
then applied a template matching algorithm to detect mal-
ware. PolyUnpack [24] can identify unpack-executing mal-
ware based on a combination of static and dynamic anal-
ysis. This approach is based on the observation that se-
quences of packed or hidden code in a malware instance
can be identiﬁed when its execution is checked against its
static code model. Chouchane and Lakhotia [8] proposed
using “engine signatures” to assist in detecting metamor-
phic malware. Basically, this technique evaluates collected
forensic evidence from x86 code segments through a code
scoring function. This score is a measure of how likely it
is that the code has been generated by a known instruction-
substituting metamorphic engine.
There has also been substantial work on the detection
of obfuscated malicious code in network trafﬁc. Zhang
et
al
. [30] proposed a novel approach that can detect in net-
work packets highly obfuscated polymorphic exploit code,
based on static program analysis, and emulated instruction
execution. This work is limited to the detection of malware
that uses polymorphism, or self-decryption. Newsome
et
al
. [23] proposed a signature generation system, which gen-
erates signatures that consist of multiple disjoint content
substrings to match polymorphic or metamorphic worms.
This approach is based on the assumption that there are in-
variant substrings that are present in all variants of a worm
payload, and thus form a signature. Support for this as-
sumption comes from exploits which contain common sub-
strings that are crucial to successfully exploiting vulnera-
ble servers. This approach needs to collect enough training
ﬂows to conﬁdently produce valid, invariant signatures.
Kruegel
et al
. [18] proposed the use of structural anal-
ysis and comparison of binary code, based on its control
ﬂow. They used graph theory (coloring, isomorphism) to
improve the results of their comparison. The difﬁculties
of disassembling obfuscated code located at arbitrary loca-
tions inside network trafﬁc are not fully addressed, however,
and the method is computationally expensive. We also have
found that metamorphism and obfuscation change the con-
trol ﬂow in such a way that a straightforward comparison
M:(
pattern1, pattern2
)
− >
r
, where
r
∈
[0
,
1]
. When two code fragments
k
and
l
have
similar or identical functionality, it is desired that
M(f(k),
f(l))
be as close to 1 as possible; otherwise, it is desired
that
M(f(k), f(l))
be close to 0. A weighted match-
ing is used to compute the degree of similarity between two
code fragments, based on their patterns.
The proposed approach differs from a previous research
contribution [12] with similar assumptions and goals. That
work proposes to use semantic templates of certain mali-
cious behavior (such as the decryption loop in polymorphic
malware) to detect malware. The templates are generated
by studying the common behavior of a set of collected mal-
ware instances, and are generated manually. The method in
this paper, in contrast, automatically generates a pattern that
characterize a program’s semantics, and uses this pattern to
detect either obfuscated, or mutated, malware.
The proposed method has been implemented and eval-
uated on actual malware variants, widely-used benchmark
programs that have been randomized, and different releases
of of the GNU binutils programs [5]. The evaluation results
demonstrate the proposed method can make a clear distinc-
tion between semantically equivalent or related programs,
and those that are not. The measured similarity score of an
original benchmark program and its randomized version in
most cases achieves a value of .95 or greater. The measured
similarity scores of different releases of the GNU binutil
programs can achieve .75 or better. The measured similar-
ity scores for different malware variants vary, but there is
a very clear distinction between malware variants and non-
variants. To apply the proposed approach in practice, a rea-
sonable threshold can be set by the user to determine the
sensitivity of malware detection.
The rest of this paper is organized as follows. Section
2 brieﬂy reviews related work. Section 3 explains the pro-
posed method. Section 4 presents the results of evaluating
this method experimentally. Section 5 compares the pro-
posed method with previous approaches, and points out the
limitations of the proposed method. Section 6 concludes the
paper.
on structural characteristics is likely to fail.
Other recent work [17, 11] has analyzed unique malware
execution behavior to deal with the problems of polymor-
phism and metamorphism.
We now present a new method of detecting obfuscated
variants, or mutants, of malware. Following the description,
the method is evaluated experimentally, and compared with
some of these recent approaches.
identical. To make a decision whether an unknown program
is similar enough to a known malware to require that it be
quarantined, this score must be greater than a user-deﬁned
threshold.
We propose that patterns are based primarily on the sys-
tem calls or library executed by the malware. We propose
to statically analyze the control and data ﬂow of
call traces
,
which are the instructions that prepare the parameters used
by a system call, plus the corresponding call instruction.
System call based modeling has been frequently used to
characterize a program’s behavior for intrusion detection
purpose [25, 27]. It is a reasonable assumption that a com-
promised application cannot cause much harm unless it in-
teracts with the underlying operating system [27].
As an illustration of this behavior, the Sapphire worm ex-
ecutes the following set of system calls [3]:
LoadLibrary
,
GetProcAddress
,
GetTickCount
,
socket
,
sendto
.
Malware which did not make use of such system functions
would likely be harder to write, and result in a much larger
code size. The use of existing code obfuscation techniques
or metamorphic program transformations does not in gen-
eral remove such system calls from the malware.
The proposed pattern generation and matching methods
are now described in detail.
3
The New Method
In this section we present a new method of static analy-
sis of executables. This method disassembles two executa-
bles, and then computes the degree of similarity between
them. The essential characteristics used for this compari-
son are the system or library function calls made by the two
programs. The method is intended to be used for recog-
nition (and subsequent isolation) of metamorphic malware
and malware variants.
3.1
Overview
Static program analysis is used for many purposes, such
as security vulnerabilities checking [28, 14], and program
behavior modeling for intrusion detection [25, 27]. Static
program analysis needs to be done only once, and does not
require run-time monitoring of program execution, which
has substantial overhead. Proving that two programs (for
instance, an instance of a virus and a suspected metamor-
phic variant) are functionally equivalent is an undecidable
problem, unfortunately. The goal for static analysis of this
paper is thus less than a full proof of functional equivalence.
Instead, we propose to to characterize a program in a way
that can combine both structure and function. This charac-
terization is referred to as the
pattern
of a code fragment.
Ideally, the pattern should be markedly different for distinct
malware, and the obfuscation used by metamorphic engines
would not drastically change the pattern derived by static
analysis. The challenge is to compute such patterns quickly,
and to ﬁnd a way to compare patterns that yields insight into
the similarity between program functions. These patterns
then can be used in a way that is similar to the way that sig-
natures are used by conventional virus checkers. The use of
patterns must be substantially more resistant to obfuscation
than the use of ﬁxed signatures, however.
When two programs are analyzed to produce patterns
representing their function, these patterns can be compared
to determine how similar the programs are. The process of
comparing patterns is termed
pattern matching
in this paper.
The output of pattern matching is a
similarity score
between
0 and 1, where a value of 0 is interpreted to mean the pro-
gram functions are very different, and a value of 1 is inter-
preted to mean the program functions are extremely close or
3.2
Pattern Generation Based on Static
Analysis
As mentioned above, system calls and library function
calls are proposed to be used as the basis for generating a
pattern that characterizes a target malware. Control ﬂow
and data ﬂow analysis are used for this purpose.
To generate a pattern for code fragment
is disassem-
bled ﬁrst. There are a number of methods and tools de-
signed to disassemble obfuscated binary code. The method
of Kruegel
et al
[19] was adapted for this purpose.
Once the code is disassembled, the control ﬂow is eas-
ily obtained through static analysis. The result of such an
analysis is a set of basic blocks, and the transfers of control
between those blocks. The call instructions that branch to
system or library functions are then identiﬁed. Let
p
,
p
I
i
de-
note such a library call instruction in block
, and let the
total number of library call instructions in the program be
denoted as
i
N
.
The next task is to identify instructions that affect the pa-
rameters (values in memory or registers) used by the system
functions when they are called by the program. While there
can be many such parameters, the only such parameter used
at present by the proposed method is the
target address
of
the
call
instruction. Finding the instructions that affect
the target address can be accomplished by data ﬂow analy-
sis.
The data ﬂow analysis is initially given a single block
i
,
and includes the system call or library function call
I
i
con-
instructions in each maximal trace are symbolically exe-
cuted in a very limited way. Currently, this symbolic execu-
tion is simply the propagation of constant values. Suppose
the ﬁrst (earliest) instruction in a trace assigns a constant
value
c
to a register or memory location, and this constant
value can be propagated to the target system call
tained in that block. In block
i
, the instructions affecting
the parameters of
I
i
are determined. Essentially, they refer
to the instructions with deﬁnitions reaching
1
this call. For
each of these instructions, the blocks affecting their input
operands are determined by data ﬂow analysis. This pro-
cess of backwards data ﬂow analysis continues until either
(a) the target block
I
i
as the
address to which ﬂow of control will occur. This
call
instruction and the (constant) target address then form an
element
of a subpattern for a single maximal trace ending at
I
i
. Note that this symbolic execution is not sophisticated
enough to recognize all target addresses unambiguously.
For instance, some addresses may be computed from infor-
mation that is not available until runtime. Therefore, when
an instruction cannot be symbolically executed, the propa-
gated constants are bound to it and recorded. All such in-
structions in their intermediate representation are recorded
in order and form an element of the subpattern for a single
maximal trace ending at
is again reached (in which case a cycle
has been discovered), or (b) there are no more instructions
remaining for which backwards data ﬂow analysis must be
performed. The dependency or data ﬂow relationship be-
tween instructions can then be represented as a graph, with
a vertex for each instruction in the program, a directed edge
from
i
af-
fects the operand(s) of the instruction corresponding to ver-
tex
u
to
v
if the instruction corresponding to vertex
u
v
, and the vertex representing
I
i
as the sink of the graph.
A
maximal instruction trace
in this graph is a path from
a vertex having no predecessor in the graph, to the vertex
representing
B
I
i
. An executable instruction in a
maximal trace is not included in the element.
The
subpattern
for
I
i
. The above process is performed for each
system or library function call encountered in the disassem-
bled code.
Following this data ﬂow analysis, the instructions of each
maximal instruction trace are processed to generate a sub-
pattern for the trace. For this purpose, each instruction is
ﬁrst converted into an intermediate representation, based on
the semantics of the instruction. This intermediate represen-
tation is convenient for processing, and allows functionally
equivalent instructions to be represented in the same way.
It also allows the method to be applied regardless of the in-
struction set architecture, although at present only the Intel
x86 architecture [6] is targeted, due to its popularity.
The intermediate representation consists of the
opera-
tion type
,the
operands
, and the
operand addressing modes
(i.e. immediate data, register, or memory addressing). The
operation types for the x86 architecture are classiﬁed into
seven major categories (e.g., data transfer, arithmetic, logi-
cal, control transfer) and within each category multiple sub-
categories may be deﬁned. For instance, the
loop
and
jcc
instructions both transfer control and therefore are as-
signed to the same operation type. Operands are classiﬁed
as being of type source (read only) or destination (write
only or read/write), and the addressing mode and associ-
ated register, if any, are recorded. The conversion to in-
termediate form allows many instructions that are function-
ally equivalent to be identiﬁed to a limited degree. For in-
stance, using intermediate representations, the instructions
sub ecx, ecx
and
xor ecx, ecx
are identiﬁed as
functionally equivalent to
mov ecx, 0
, and the instruc-
tion
push eax
is identiﬁed as being equivalent to
dec
esp, 4
followed by
mov [esp], eax
.
After conversion to the intermediate representation, the
U
i
, is the set of all such
elements for all maximal traces ending at
I
i
, denoted
I
i
.Thesetofall
such subpatterns
U
i
for all the system or function calls in
code fragment
, and is denoted as
P
p
. The intuition behind this deﬁnition of a program pat-
tern is that a malware program will normally make use of
some well-deﬁned system services, whose addresses must
be found (so that they can be accessed) in a well deﬁned
way. Attempts to obfuscate the program function, without
changing the set of system or library function calls, can still
leave this behavior visible to inspection. Even obfuscation
of the target of a system call may leave the true target ex-
posed as one possibility. Since the proposed method uses
all possible traces, this true target will remain part of the
pattern of the code fragment. The use of both symbolic ex-
ecution and control ﬂow analysis for disassembly will also
overcome many known methods of obfuscation, as will be
showninsection4.
Figure 1 shows an example of the the patterns gener-
ated for code fragments of the Sapphire worm, and for a
metamorphic variant of this worm. For purposes of illus-
tration, each sub-pattern is presented in Intel x86 assem-
bly language form, and is a result of data ﬂow analysis and
symbolic execution. For instance, subpattern 1 of pattern
PA
p
is called the pattern of
p
results from symbolic execution of the instruction trace
mov esi, [0x42AE1018]
||
call [esi]
, in which
the second instruction operand depends on the ﬁrst instruc-
tion. Subpattern 3 of pattern
has two elements, which
result from two traces whose target is the same library func-
tion call. In this example, there happens to be multiple sub-
patterns which are identical. This is because some of the
library functions are called in multiple places, with differ-
ent parameters.
The next section explains a method of pattern matching
PA
1
Please refer to a textbook on compiler theory [22] for the explanation
of reaching deﬁnition in data ﬂow analysis.
Pattern(PA)
Sub-Patterns
Pattern(PB)
Sub-Patterns
A
pattern matching
of
k
and
l
is a one-to-one assign-
ment from the set of subpatterns of
k
to the set of subpat-
mov ebp, esp
mov eax, [ebp-40]
call eax
1
call [0x42AE1018]
1
terns of
. A maximum matching is one that includes all
of the subpatterns of
l
2
call [0x42AE1018]
call [0x42AE101C]
.
A maximum weighted matching is one that maximizes the
sum of the similarity scores of the pairs of subpatterns that
are matched. The value or score produced by a maximum
weighted matching
k
, and/or all of the subpatterns of
l
2
3
4
call [0x42AE1010]
call [0x42AE101C]
call [0x42AE1010]
3
4
mov ebx, [0x42AE1010]
mov eax, [ebx]
call eax
call [0x42AE1018]
call [0x42AE101C]
call [0x42AE1010]
5
5
mov ebp, esp
mov eax, [ebp-40]
call eax
is equal to the mean of the similarity
scores of pairs of subpatterns that are present in that match-
ing:
W
call [0x42AE101C]
call [0x42AE1010]
6
mov ebp, esp
mov eax, [ebp-40]
call eax
6
call [0x42AE101C]
call [0x42AE1010]
7
7
mov ebx, [0x42AE1010]
mov eax, [ebx]
call eax
call [0x42AE101C]
call [0x42AE1010]
score
(
U
i
,U
j
)
8
mov ebp, esp
mov eax, [ebp-40]
call eax
8
call [0x42AE1018]
U
i
,U
j
∈W
M
(
P
k
,P
l
)=
(1)
max
(
N
k
,N
l
)
Figure 1.
The patterns of code fragments of Sapphire
worm and its metamorphic version.
A maximum weighted matching is an optimistic ap-
proach to computing the similarity between two code frag-
ments. The process of deriving and matching patterns
should not be greatly affected by small errors in disassem-
bly and data ﬂow analysis, or by current program obfusca-
tion techniques. These claims are evaluated in section 4.
Pattern matching is performed after similarity scores are
computed for all pairs of sub-patterns. For each such pair
of sub-patterns, the similarity score of all pairs of elements
is computed, where one element is taken from the ﬁrst sub-
pattern, and the other element is taken from the second sub-
pattern. From this, a maximum weighted matching of the
elements of the two sub-patterns is computed, in the same
way as mentioned before. The similarity score of this pair
of sub-patterns is then the mean of the similarity scores of
pairs of elements that are matched.
Finally, computing the similarity of two elements in-
volves comparison of the instructions or instruction se-
quences (still in their intermediate form) in the two ele-
ments. This step ﬁnds a maximum weighted mapping be-
tween the instructions in the two elements. To do this, it is
required to compute the similarity between any two instruc-
tions, using as input their intermediate forms. The compu-
tation is only an estimate of the similarity between instruc-
tions. Therefore, a heuristic method is used. This method
ﬁrst computes the similarity between operation types. As
an example,
add
and
subtract
operations are deemed
to be similar, while
add
and
call
are not. The compar-
ison of operands checks for each operand pair whether the
addressing mode, and register or memory addresses or im-
mediate operands (when they can be determined) are the
same, and scores them based on closeness. Closeness of
operands is weighted more heavily than closeness of oper-
ation types when computing a ﬁnal similarity score for two
instructions. This computation is designed to be accurate
enough to capture most obfuscations used in practice.
Figure 2 shows an example of the maximum weighted
matching process for the two patterns shown in Figure 1.
to compute the similarity between two binaries. The input
to this process is the patterns derived from the binaries in
the way just described. The pattern matching algorithm is
intended to overcome the differences between two variants
of the same malware.
3.3
Pattern Matching
The purpose of pattern matching is to determine if two
code fragments are similar enough to exhibit functional
equivalency. The proposed method does not produce a for-
mal proof of equivalence. Not only is that undecidable, but
malware variants may in fact compute somewhat different
results. Rather, we consider similarity in system or function
call behavior to be strong evidence that programs have a
similar purpose. The two requirements for deﬁning patterns
and the resulting pattern matching algorithm are:
1. The pattern derived from one malware program should
be very different from patterns derived from other pro-
grams, whether benign, or malware of another type.
2. Patterns derived from metamorphic variants of a single
malware program should be very similar.
The matching algorithm is deﬁned as follows. Two code
fragments
may be, for instance,
a known instance of malware. The pattern for
k
and
l
are given, where
k
k
has been
P
k
=
{U
1
U
N
k
}
computed and is represented as
, ...,
.The
l
P
l
=
pattern for
has been computed and is represented as
{U
1
U
l
N
l
}
.
Let similarity scores be real values between 0 (minimum
similarity) and 1 (maximum similarity). Suppose similarity
scores between all pairs of subpatterns, where one subpat-
tern is taken from
, ...,
P
k
and one subpattern is taken from
P
l
,
have been computed.
[ Pobierz całość w formacie PDF ]

zanotowane.pl

doc.pisz.pl

pdf.pisz.pl

sylwina.xlx.pl

MetaAware Identifying ...

MetaAware Identifying Metamorphic Malware, Hacking and IT E-Book Dump Release

Tematy