Method And System For Determining Affiliation Of Software To Software Families Patent Application (2025)

U.S. patent application number 17/087775 was filed with the patent office on 2021-06-10 for method and system for determining affiliation of software to software families.The applicant listed for this patent is Group IB TDS, Ltd. Invention is credited to Ilia Sergeevich POMERANTSEV, Pavel Vladimirovich SLIPENCHUK.

Application Number20210173927 17/087775
Document ID /
Family ID1000005298907
Filed Date2021-06-10
United States PatentApplication20210173927
Kind CodeA1
SLIPENCHUK; Pavel Vladimirovich ; et al.June 10, 2021

METHOD AND SYSTEM FOR DETERMINING AFFILIATION OF SOFTWARE TOSOFTWARE FAMILIES

Abstract

A method and a system for determining an affiliation of a givensoftware with target software are provided. The method comprises:receiving a file including a machine code associated with the givensoftware; determining a file format; identifying, based on the fileformat, in the machine code, at least one function of a pluralityof functions; generating, for each one of the plurality offunctions associated with the given software, a respective functionidentifier; aggregating respective function identifiers, therebygenerating an aggregated array of function identifiers associatedwith the given software; applying at least one classifier to theaggregated array of function identifiers to determine a likelihoodparameter indicative of the given software being affiliated to arespective target software; in response to the likelihood parameterbeing equal to or greater than a predetermined likelihood parameterthreshold: identifying the given software as being affiliated tothe respective target software.

Inventors:SLIPENCHUK; PavelVladimirovich; (Moscow, RU) ; POMERANTSEV; IliaSergeevich; (Mendeleevo, RU)
Applicant:
NameCityStateCountryType

Group IB TDS, Ltd

Moscow

RU
Family ID:1000005298907
Appl. No.:17/087775
Filed:November 3, 2020
Current U.S.Class:1/1
Current CPCClass:G06F 21/563 20130101;G06F 21/602 20130101; G06F 21/14 20130101; G06F 21/568 20130101;G06F 21/554 20130101
InternationalClass:G06F 21/56 20060101G06F021/56; G06F 21/55 20060101 G06F021/55; G06F 21/60 20060101G06F021/60; G06F 21/14 20060101 G06F021/14

Foreign Application Data

DateCodeApplication Number
Dec 5, 2019RU2019139628

Claims

1. A method for determining an affiliation of a given software totarget software, the method being executable by a processor, themethod comprising: receiving, by the processor, a file including amachine code associated with the given software; determining a fileformat of the file associated with the given software, thedetermining comprising applying a signature analysis to the file;identifying, by the processor, based on the file format, in themachine code of the given software, at least one function of aplurality of functions associated with the given software; parsing,by the processor, the at least one function to identify therein atleast one function command; generating, by the processor, for eachone of the plurality of functions associated with the givensoftware, a respective function identifier, a given functionidentifier associated with the at least one function beinggenerated based on each of the at least one function command;aggregating, by the processor, respective function identifiers ofthe plurality of functions associated with the given software,thereby generating an aggregated array of function identifiersassociated with the given software; applying, by the processor, atleast one classifier to the aggregated array of functionidentifiers to determine a likelihood parameter indicative of thegiven software being affiliated to a respective target software,the at least one classifier having been trained for determining anaffiliation to the respective target software; in response to thelikelihood parameter being equal to or greater than a predeterminedlikelihood parameter threshold: identifying the given software asbeing affiliated to the respective target software; storing dataindicative of the given software in a database of affiliatedsoftware; and using the data indicative of the given software forfurther determining affiliation to the respective targetsoftware.

2. The method of claim 1, wherein, if the machine code has beenprocessed using one of predetermined processes, the identifying theat least one function further comprises: executing, by theprocessor, the machine code associated with the given software inan isolated program environment to receive one or more memory dumpsassociated with the given software; restoring, based on the one ormore memory dumps, the machine code for identifying therein the atleast one function.

3. The method of claim 2, wherein one of the predeterminedprocesses comprises one of encryption, compression, andobfuscation.

4. The method of claim 2, wherein the identifying the at least onefunction further comprises disassembling, by the processor, themachine code of the given software.

5. The method of claim 2, wherein the identifying the at least onefunction further comprises identifying, in the machine code,library functions and deleting the library functions therefrom.

6. The method of claim 2, wherein the identifying the at least onefunction further comprises identifying, in the machine code,machine code portions inherently non-indicative of the affiliationto the target software and deleting the machine code portionsinherently non-indicative of the affiliation to the target softwaremachine code portions from the machine code.

7. The method of claim 2, wherein the at least one function commandcomprises at least one action and at least one argument associatedwith the at least one action; and the generating the respectivefunction identifier further comprises: applying a hash function tothe at least one action and to each value of the at least oneargument to generate respective hash function values, each one ofthe respective hash function values being a respective numbersequence; concatenating the respective hash function values.

8. The method of claim 1, wherein the at least one classifier hasbeen trained to determine the affiliation to the respective targetsoftware based on a training set of data, and the method furthercomprising generating the training set of data, the generatingcomprising: receiving, by the processor, a plurality of targetsoftware files, each target software file including a respectivetarget machine code associated with the respective target software;determining, by the processor, for each one of the plurality oftarget software files, a respective target file format, thedetermining comprising applying, by the processor, a signatureanalysis to each of the plurality of target software files;identifying, by the processor, based on the respective target fileformat associated with a given one of the plurality of targetsoftware files, in a respective target machine code, at least onetarget function; parsing, by the processor, the at least one targetfunction to identify therein at least one target function command;generating, by the processor, based on each of the at least onetarget function command, a respective target function identifierassociated with the at least one target function, the respectivetarget function identifier comprising an associated numbersequence; aggregating, by the processor, associated numbersequences associated with respective target functions over theplurality of target software files, thereby generating a numberarray associated with the respective target software; identifying,by the processor, in the number array associated with therespective target software, at least one pattern, wherein: the atleast one pattern comprises a predetermined repetitive numbersequence within the number array, and the predetermined repetitivenumber sequence is indicative of a frequency of occurrence of atleast one associated target function command within the respectivetarget software; storing the at least one pattern with a labelindicative of an association between the at least one pattern andthe respective target software for inclusion thereof into thetraining set of data.

9. The method of claim 8, wherein, if the respective target machinecode has been processed using one of predetermined processes, theidentifying the at least one target function further comprises:executing, by the processor, the respective target machine codeassociated with the respective target software in an isolatedprogram environment to receive one or more memory dumps associatedwith the respective target software; restoring, based on the one ormore memory dumps, the respective target machine code foridentifying therein the at least one target function.

10. The method of claim 9, wherein a length of the predeterminedrepetitive number sequence is determined as a constant number.

11. The method of claim 9, wherein the length of the predeterminedrepetitive number sequence is further determined iteratively, basedon a current number thereof within the number array.

12. The method of claim 9, further comprising determining afrequency of occurrence value associated with the at least onepattern, the determining being according to the following formula:.lamda. = L K , ##EQU00002## where L is a frequency of occurrenceof the at least one pattern within the number array associated withthe respective target software, and K is a number of machine codesin the plurality of machine codes associated with the respectivetarget software used for generating the training set of data.

13. The method of claim 9, further comprising assigning a weightvalue to the at least one pattern.

14. The method of claim 9, wherein the weight value is increased ifthe at least one pattern is indicative of mathematical operationsused within the respective target software.

15. The method of claim 14, wherein the weight value is increasedif the at least one pattern is indicative of at least two four-byteconstants used within the respective target software.

16. The method of claim 13, wherein the weight value is determinedbased on the frequency of occurrence value associated with the atleast one pattern.

17. A system for determining an affiliation of a given softwarewith target software, the system comprising a computing device, thecomputing device further comprising: a processor; a non-transitorycomputer-readable medium comprising instructions; the processor,upon executing the instructions, being configured to: receive afile including a machine code associated with the given software;determine a file format of the file associated with the givensoftware, the determining comprising applying a signature analysisto the file; identify, based on the file format, in the machinecode of the given software, at least one function of a plurality offunctions associated with the given software; parse, the at leastone function to identify therein at least one function command;generate, for each one of the plurality of functions associatedwith the given software, a respective function identifier, a givenfunction identifier associated with the at least one function beinggenerated based on each of the at least one function command;aggregate respective function identifiers of the plurality offunctions associated with the given software, thereby generating anaggregated array of function identifiers associated with the givensoftware; apply at least one classifier to the aggregated array offunction identifiers to determine a likelihood parameter indicativeof the given software being affiliated to a respective targetsoftware, in response to the likelihood parameter being equal to orgreater than a predetermined likelihood parameter threshold:identify the given software as being affiliated to the respectivetarget software; store data indicative of the given software in adatabase of affiliated software; and use the data indicative of thegiven software for further determining affiliation to therespective target software.

18. The system of claim 17, wherein, if the machine code has beenprocessed using one of predetermined processes, the processor isfurther configured to: execute, by the processor, the machine codeassociated with the given software in an isolated programenvironment to receive one or more memory dumps associated with thegiven software; restore, based on the one or more memory dumps, themachine code for identifying therein the at least one function.

19. The system of claim 18, wherein one of the predeterminedprocesses comprises one of encryption, compression, andobfuscation.

20. The system of claim 18, wherein to identify the at least onefunction, the processor is further configured to disassemble themachine code of the given software.

21. The system of claim 18, wherein to identify the at least onefunction, the processor is further configured to identify, in themachine code, library functions and delete the library functionstherefrom.

22. The system of claim 18, wherein to identify the at least onefunction, the processor is further configured to identify, in themachine code, machine code portions inherently non-indicative ofthe affiliation to the target software and deleting the machinecode portions inherently non-indicative of the affiliation to thetarget software from the machine code.

Description

CROSS-REFERENCE

[0001] The present application claims priority to Russian PatentApplication No. 2019139628, entitled "METHOD AND SYSTEM FORDETERMINING AFFILIATION OF SOFTWARE TO SOFTWARE FAMILIES," andfiled on Dec. 5, 2019, the entirety of which is incorporated hereinby reference.

TECHNICAL FIELD

[0002] The present technology broadly relates to the field ofcomputer technology; and, in particular, to methods and systems fordetermining affiliation of software to predetermined softwarefamilies and/or authorships.

BACKGROUND

[0003] As it may be known, professional cybercriminals thoroughlyelaborate an attack strategy and change it rarely, using the samemalware for a long time with insignificant modifications.

[0004] At the same time, the developers of malicious SW (MSW)creating tools for cybercriminals could use the same softwaresolution, for example, a function implementing cryptographicalgorithm, for a long time in different samples of MSW created fordifferent cybercriminal groups and associated with different MSWfamilies.

[0005] Therefore, in the field of cyber security, it may beimportant to know with which MSW family a given sample of MSW isaffiliated and/or who is an author (or a group thereof) of thegiven sample of MSW.

[0006] Signature analysis is a well-known method of MSW detection.This method is based on the search of a unique sequence of bytes infiles including machine code of MSW, i.e. signature which isindicative of a specific MSW. A respective signature associatedwith the given sample of MSW may be determined based on analyzing amachine code associated therewith. Further, the respectivesignature can be stored into a virus signature database, to whichan antivirus program may be provided access, thereby allowing fordetection of the given sample of MSW.

[0007] However, this method is also well-known to cybercriminals.Therefore, nearly all types of modern MSW are constantly modifiedto change basic functionality thereof. As a result of suchmodifications, the files of a next version of the given MSW mayacquire new properties, which may render the given MSWunrecognizable for antivirus signature analyzers as malicious,which may thus allow cybercriminals to conduct attacks without anyobstacles.

[0008] Besides the modifications, various approaches of obfuscationare widely used. Broadly speaking, in the context of the presentspecification, "obfuscation" refers to a technique of modifying amachine code of the given MSW such that functionality thereof ispreserved, however, analyzing it to determine operation algorithmsbecomes more complicated. The above-mentioned modifications to thegiven MSW could be performed either by a human or automatically,e.g. by so called polymorphic generator, which may be part of amalware.

[0009] At the same time, as a result of the obfuscation, operatingfunctions of the given MSW are not significantly altered. Forexample, after the modification the given MSW will "look"differently only for signature analyzers, its code could beobfuscated and hence cannot be analyzed by a human; however, a setof operating functions of the given MSW performed before theobfuscation is likely to remain unchanged thereafter.

[0010] Certain prior art approaches are directed to determiningauthorship of different types of texts, such as literary,publicistic, or scientific based on stylometric analysisthereof.

[0011] An article written by Dauber et al., published by DrexelUniversity, Philadelphia, USA, and entitled "Stylometric AuthorshipAttribution in Collaborative Documents", discloses applyingstylometry to a novel dataset of multi-authored documents collectedfrom Wikia using both relaxed classification with a support vectormachine (SVM) and multi-label classification techniques. Fivepossible scenarios are defined that show that one, the case wherelabeled and unlabeled collaborative documents by the same authorsare available, yields high accuracy on the dataset while the other,more restrictive cases yield lower accuracies. Based on the resultsof these experiments and knowledge of the multi-label classifiersused, there is proposed a hypothesis to explain this overall poorperformance. Additionally, there is performed authorshipattribution of pre-segmented text from the Wikia dataset showingthat while this performs better than multi-label learning itrequires large amounts of data to be successful.

[0012] A PhD thesis written by S. Afroz at Drexel University,Philadelphia, USA, and entitled "Deception In AuthorshipAttribution" discloses authorship attribution methods inadversarial settings where authors take measures to hide theiridentity by changing their writing style and by creating multipleidentities; using a large feature set to distinguish regulardocuments from deceptive documents with high accuracy and presentan analysis of linguistic features that can be modified to hidewriting style; adapting regular authorship attribution to difficultdatasets such as leaked underground forum; and presenting a methodfor detecting multiple identities of authors. Further,demonstrating the utility of the approach with a case study thatincludes applying the technique to an underground forum and manualanalysis to validate the results, enabling the discovery ofpreviously undetected multiple accounts.

[0013] An article written by Alexander Granin, published by the webresource habr.com, and entitled "Text Analyzer" appears to disclosean automatic approach to determining authorship of texts based onHamming Neural Network.

SUMMARY

[0014] Developers of the present technology have realized that thestylometric approaches, i.e. those based on analyzing textstylistics, for determining program code authorship, may not be anoptimal solution. Regardless of the programming language in whichthe code is written, defining the author style in it would beextremely difficult for the reason of the specific nature of theart. In cases where the program source code is not available, thestylometric approaches for samples of MSW analysis does not appearto be appropriate.

[0015] Therefore, non-limiting embodiments of the presenttechnology are directed to methods and systems for determiningaffiliation of given software to a predetermined family of softwareand/or authorship based on specific features associated therewiththat are derived from a machine code thereof. It should beexpressly understood that the method and systems described hereinare not limited to MSW and may be used for any software.

[0016] More specifically, according to a first broad aspect of thepresent technology, there is provided a method for determining anaffiliation of a given software to target software. The method isexecutable by a processor. The method comprises: receiving, by theprocessor, a file including a machine code associated with thegiven software; determining a file format of the file associatedwith the given software, the determining comprising applying asignature analysis to the file; identifying, by the processor,based on the file format, in the machine code of the givensoftware, at least one function of a plurality of functionsassociated with the given software; parsing, by the processor, theat least one function to identify therein at least one functioncommand; generating, by the processor, for each one of theplurality of functions associated with the given software, arespective function identifier, a given function identifierassociated with the at least one function being generated based oneach of the at least one function command; aggregating, by theprocessor, respective function identifiers of the plurality offunctions associated with the given software, thereby generating anaggregated array of function identifiers associated with the givensoftware; applying, by the processor, at least one classifier tothe aggregated array of function identifiers to determine alikelihood parameter indicative of the given software beingaffiliated to a respective target software, the at least oneclassifier having been trained for determining an affiliation tothe respective target software; in response to the likelihoodparameter being equal to or greater than a predetermined likelihoodparameter threshold: identifying the given software as beingaffiliated to the respective target software; storing dataindicative of the given software in a database of affiliatedsoftware; and using the data indicative of the given software forfurther determining affiliation to the respective targetsoftware.

[0017] In some implementations of the method, if the machine codehas been processed using one of predetermined processes, theidentifying the at least one function further comprises: executing,by the processor, the machine code associated with the givensoftware in an isolated program environment to receive one or morememory dumps associated with the given software; restoring, basedon the one or more memory dumps, the machine code for identifyingtherein the at least one function.

[0018] In some implementations of the method, one of thepredetermined processes comprises one of encryption, compression,and obfuscation.

[0019] In some implementations of the method, the identifying theat least one function further comprises disassembling, by theprocessor, the machine code of the given software.

[0020] In some implementations of the method, the identifying theat least one function further comprises identifying, in the machinecode, library functions and deleting the library functionstherefrom.

[0021] In some implementations of the method, the identifying theat least one function further comprises identifying, in the machinecode, machine code portions inherently non-indicative of theaffiliation to the target software and deleting the machine codeportions inherently non-indicative of the affiliation to the targetsoftware machine code portions from the machine code.

[0022] In some implementations of the method, the at least onefunction command comprises at least one action and at least oneargument associated with the at least one action; and thegenerating the respective function identifier further comprises:applying a hash function to the at least one action and to eachvalue of the at least one argument to generate respective hashfunction values, each one of the respective hash function valuesbeing a respective number sequence; concatenating the respectivehash function values.

[0023] In some implementations of the method, wherein the at leastone classifier has been trained to determine the affiliation to therespective target software based on a training set of data, and themethod further comprising generating the training set of data, thegenerating comprising: receiving, by the processor, a plurality oftarget software files, each target software file including arespective target machine code associated with the respectivetarget software; determining, by the processor, for each one of theplurality of target software files, a respective target fileformat, the determining comprising applying, by the processor, asignature analysis to each of the plurality of target softwarefiles; identifying, by the processor, based on the respectivetarget file format associated with a given one of the plurality oftarget software files, in a respective target machine code, atleast one target function; parsing, by the processor, the at leastone target function to identify therein at least one targetfunction command; generating, by the processor, based on each ofthe at least one target function command, a respective targetfunction identifier associated with the at least one targetfunction, the respective target function identifier comprising anassociated number sequence; aggregating, by the processor,associated number sequences associated with respective targetfunctions over the plurality of target software files, therebygenerating a number array associated with the respective targetsoftware; identifying, by the processor, in the number arrayassociated with the respective target software, at least onepattern, wherein: the at least one pattern comprises apredetermined repetitive number sequence within the number array,and the predetermined repetitive number sequence is indicative of afrequency of occurrence of at least one associated target functioncommand within the respective target software; storing the at leastone pattern with a label indicative of an association between theat least one pattern and the respective target software forinclusion thereof into the training set of data.

[0024] In some implementations of the method, if the respectivetarget machine code has been processed using one of predeterminedprocesses, the identifying the at least one target function furthercomprises: executing, by the processor, the respective targetmachine code associated with the respective target software in anisolated program environment to receive one or more memory dumpsassociated with the respective target software; restoring, based onthe one or more memory dumps, the respective target machine codefor identifying therein the at least one target function.

[0025] In some implementations of the method, a length of thepredetermined repetitive number sequence is determined as aconstant number.

[0026] In some implementations of the method, the length of thepredetermined repetitive number sequence is further determinediteratively, based on a current number thereof within the numberarray.

[0027] In some implementations of the method, the method furthercomprises determining a frequency of occurrence value associatedwith the at least one pattern, the determining being according tothe following formula:

.lamda. = L K , ##EQU00001##

where L is a frequency of occurrence of the at least one patternwithin the number array associated with the respective targetsoftware, and K is a number of machine codes in the plurality ofmachine codes associated with the respective target software usedfor generating the training set of data.

[0028] In some implementations of the method, the method furthercomprises assigning a weight value to the at least one pattern.

[0029] In some implementations of the method, the weight value isincreased if the at least one pattern is indicative of mathematicaloperations used within the respective target software.

[0030] In some implementations of the method, the weight value isincreased if the at least one pattern is indicative of at least twofour-byte constants used within the respective target software.

[0031] In some implementations of the method, the weight value isdetermined based on the frequency of occurrence value associatedwith the at least one pattern.

[0032] In accordance with a second broad aspect of the presenttechnology, there is provided a system for determining anaffiliation of a given software with target software. The systemcomprises a computing device. The computing device furthercomprises: a processor; a non-transitory computer-readable mediumcomprising instructions. The processor, upon executing theinstructions, is configured to: receive a file including a machinecode associated with the given software; determine a file format ofthe file associated with the given software, the determiningcomprising applying a signature analysis to the file; identify,based on the file format, in the machine code of the givensoftware, at least one function of a plurality of functionsassociated with the given software; parse, the at least onefunction to identify therein at least one function command;generate, for each one of the plurality of functions associatedwith the given software, a respective function identifier, a givenfunction identifier associated with the at least one function beinggenerated based on each of the at least one function command;aggregate respective function identifiers of the plurality offunctions associated with the given software, thereby generating anaggregated array of function identifiers associated with the givensoftware; apply at least one classifier to the aggregated array offunction identifiers to determine a likelihood parameter indicativeof the given software being affiliated to a respective targetsoftware, in response to the likelihood parameter being equal to orgreater than a predetermined likelihood parameter threshold:identify the given software as being affiliated to the respectivetarget software; store data indicative of the given software in adatabase of affiliated software; and use the data indicative of thegiven software for further determining affiliation to therespective target software.

[0033] In some implementations of the system, if the machine codehas been processed using one of predetermined processes, theprocessor is further configured to: execute, by the processor, themachine code associated with the given software in an isolatedprogram environment to receive one or more memory dumps associatedwith the given software; restore, based on the one or more memorydumps, the machine code for identifying therein the at least onefunction.

[0034] In some implementations of the system, one of thepredetermined processes comprises one of encryption, compression,and obfuscation.

[0035] In some implementations of the system, to identify the atleast one function, the processor is further configured todisassemble the machine code of the given software.

[0036] In some implementations of the system, to identify the atleast one function, the processor is further configured toidentify, in the machine code, library functions and delete thelibrary functions therefrom.

[0037] In some implementations of the system, to identify the atleast one function, the processor is further configured toidentify, in the machine code, machine code portions inherentlynon-indicative of the affiliation to the target software anddeleting the machine code portions inherently non-indicative of theaffiliation to the target software from the machine code.

[0038] Platform or computer platform is an environment where agiven piece of software is executed. The platform includes bothhardware (e.g. random-access memory, hard disk) and software (BIOS,operating system, etc.). Non-limiting example of a platform may bea Win32 API platform.

[0039] Obfuscation or code obfuscation is deliberate modifying aninitial machine code of a software program such that it isdifficult for humans to understand, while preserving itsfunctionality.

[0040] Logging is automatic recording actions performed by asoftware program in chronological order into a specific file, whichmay be referred to as a log or a report.

[0041] Machine code associated with a given software denotes a setof instructions associated with the given software and developed inany programming language; however, further translated into arespective series of numerical symbols to be read and executeddirectly in a central processing unit (CPU)--that is, a machinelanguage or hardware-dependent programming language. The machinecode can also be referred to as a lowest-level programming languageset of instructions associated with the given software, whichgenerally cannot be read and/or interpreted by a human and is onlyintended for execution in the CPU.

[0042] Machine code portions inherently non-indicative ofaffiliation with target software are code snippets, which could befound in a variety of programs associated with a specific type ofsoftware--the target software. Such machine code portions are usednot only in software of specified purpose or by a certain authorbut practically ubiquitously. For example, such machine codeportions may be indicative of, without being limited to, functionprologues in a respective machine code or in a respective assemblycode, as an example.

[0043] Framework is a program platform defining a structure of aprogram system--for example, a Django framework.

[0044] Further, in the context of the present specification, unlessexpressly provided otherwise, a computer system may refer, but isnot limited to, an "electronic device", an "operation system", a"system", a "computer-based system", a "controller unit", a"control device" and/or any combination thereof appropriate to therelevant task at hand.

[0045] In the context of the present specification, unlessexpressly provided otherwise, the expression "computer-readablemedium" and "memory" are intended to include media of any natureand kind whatsoever, non-limiting examples of which include RAM,ROM, disks (CD-ROMs, DVDs, floppy disks, hard disk drives, etc.),USB keys, flash memory cards, solid state-drives, and tapedrives.

[0046] In the context of the present specification, a "database" isany structured collection of data, irrespective of its particularstructure, the database management software, or the computerhardware on which the data is stored, implemented or otherwiserendered available for use. A database may reside on the samehardware as the process that stores or makes use of the informationstored in the database or it may reside on separate hardware, suchas a dedicated server or plurality of servers.

[0047] In the context of the present specification, unlessexpressly provided otherwise, the words "first", "second", "third",etc. have been used as adjectives only for the purpose of allowingfor distinction between the nouns that they modify from oneanother, and not for the purpose of describing any particularrelationship between those nouns.

BRIEF DESCRIPTION OF THE DRAWINGS

[0048] Non-limiting embodiments of the present technology aredescribed herein with reference to the accompanying drawings; thesedrawings are only presented to explain the essence of thetechnology and are not limiting the scope thereof in any way,where:

[0049] FIG. 1 depicts a flowchart diagram of a method fordetermining an affiliation of a given software to a target softwarebased on specific features thereof derived from a machine codeassociated therewith, in accordance with certain non-limitingembodiments of the present technology;

[0050] FIG. 2 depicts a flowchart diagram of a method foridentifying and deleting machine code portions inherentlynon-indicative of the affiliation to the target software in themachine code associated with given software, in accordance withcertain non-limiting embodiments of the present technology;

[0051] FIG. 3 depicts a flowchart diagram of a method for traininga classifier used in the method of FIG. 1 for determining theaffiliation of the given software with the target software, inaccordance with certain non-limiting embodiments of the presenttechnology;

[0052] FIG. 4 depicts a schematic diagram of an example computingenvironment configurable for execution of one of the methods ofFIGS. 1, 2, and 3, in accordance with certain non-limitingembodiments of the present technology.

DETAILED DESCRIPTION OF THE TECHNOLOGY

[0053] The following detailed description is provided to enableanyone skilled in the art to implement and use the non-limitingembodiments of the present technology. Specific details areprovided merely for descriptive purposes and to give insights intothe present technology, and in no way as a limitation. However, itwould be apparent to a person skilled in the art that some of thesespecific details may not be necessary to implement certainnon-limiting embodiments of the present technology. Thedescriptions of specific implementations are only provided asrepresentative examples. Various modifications of these embodimentsmay become apparent to the person skilled in the art; the generalprinciples defined in this document may be applied to othernon-limiting embodiments and implementations without departing fromthe scope of the present technology.

[0054] Certain non-limiting embodiments of the present technologyare directed to computer-implemented methods and systems fordetermining affiliation of given software to target software basedon analyzing an associated machine code. In some non-limitingembodiments of the present technology the target software mayinclude software of a predetermined software family and/or of apredetermined authorship.

[0055] According to certain non-limiting embodiments of the presenttechnology, each one of the methods described herein below can beexecuted by a hardware processor--for example, a processor 401 of acomputing device 400, which will be described below with referenceto FIG. 4.

Determining Affiliation to Target Software

[0056] With reference to FIG. 1, there is depicted a flowchartdiagram of a first method 100 for determining the affiliation ofthe given software with the target software, in accordance withcertain non-limiting embodiments of the present technology.

Step 110: Receiving, by the Processor, a File Including a MachineCode Associated with the Given Software

[0057] The first method 100 commences at step 100 where theprocessor 401 can be configured to receive a file containing amachine code associated with the given software for furtheranalysis. In some non-limiting embodiments of the presenttechnology, the file containing the machine code can be of variousformats, including, without limitation, an executable program file,such as an *.exe file; a dynamic library file, such as a *.dllfile; and the like.

[0058] The first method 100 further proceeds to step 120.

Step 120: Determining a File Format of the File Associated with theGiven Software

[0059] At step 120. according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured todetermine a file format of the file containing the machine codereceived at step 110. It should be expressly understood that howthe file format can determined by the processor 401 is not limited.For example, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to use a scriptspecifically preconfigured for comparing a signature of the filecontaining the machine code to a set of predetermined signaturesrespectively associated with (and/or indicative of) various fileformats. To that end, in response to determining a match thesignature of the file and at least one of the set of predeterminedsignatures, the processor 401 can be configured to identify thefile format as being one associated with the at least one of theset of predetermined signatures (which, in general, could differfrom the format associated with this file extension in a usedoperating system).

[0060] Further, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to determine if themachine code has been processed. For example, the processor 401 canbe configured to determine if the machine code associated with thegiven software has been processed by one of the followingpredetermined processes: encryption, compression, and obfuscation.How the processor 401 is configured to determine if the machinecode has been processed is not limited. In specific non-limitingembodiments of the present technology, to that end, the processor401 can be configured to calculate a partial entropy of the machinecode. Further, in response to the partial entropy, within a givenportion of the machine code, exceeding a predetermined entropythreshold value (e.g., 6), the processor 401 can be configured toidentify the machine code as being processed.

[0061] Further, in some non-limiting embodiments of the presenttechnology, in response to determining that machine code has beenprocessed, the processor 401 can be configured to execute the filecontaining it in an isolated program environment.

[0062] Accordingly, executing the file in the isolated programenvironment, the processor 401 can be configured to receive one ormore memory dumps generated in response to the executing the file.Broadly speaking, a given memory dump is a "footprint" of the fileassociated with the given software, executed by the processor 401,on a runtime memory. Thus, the given memory dump can includemachine code of a plurality of function commands associated withthe given software. In some non-limiting embodiments of the presenttechnology, such memory dumps could be generated with a relativelyhigh frequency, for example, one per each clock cycle of theprocessor 401 of the computing device 400, which may further enableto receive more detailed data on the executed file and the machinecode contained therein.

[0063] In some non-limiting embodiments of the present technology,to receive the one or more memory dumps, the processor 401 can beconfigured to use a specifically pre-configured applicationtherefor. For example, and not as a limitation, the processor 401can be configured to apply a ProcDump.TM. utility.

[0064] As each memory dump is representative of a respectiveportion of the machine code located in the runtime memory at amoment of its generation, the processor 401 can thus be configuredto restore the machine code based on the one or more memory dumpsgenerated in response to the executing the file in the isolatedprogram environment. Therefore, using the runtime memory as a datasource, the processor 401 can be configured to receive a "clean"version of the machine code before having been processed by one ofthe encryption, the compression, the obfuscation, and the like.Further analysis of the machine code may include disassemblingand/or parsing, operation algorithm of which is based on thespecification of a used computing architecture (such as x86architecture), as will be described below.

[0065] The first method 100 thus advances to step 130.

Step 130: Identifying, by the Processor, Based on the File Format,in the Machine Code of the Given Software, at Least One Function ofa Plurality of Functions Associated with the Given Software

[0066] At step 130, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured toidentify, in the machine code associated with the given software, aplurality of functions associated therewith. In the context of thepresent specification, the term "function" denotes a portion of themachine code, which could be accessed within the machine code byreference thereto. In most cases, an identifier can be linked to agiven function; however, many languages allow for anonymousfunctions. The address of the first (operator) instruction, beingpart of the given function, to which the control is passed whenreferring to the function, is inseparably associated with a name ofthe given function. Having executed the given function, the controlis returned back to a return address, i.e., to that portion of themachine code, from where the given function was called.

[0067] In some non-limiting embodiments of the present technology,to identify the plurality of functions associated with the givensoftware, the processor 401 can be configured to disassemble themachine code, thereby restoring an assembly code associated withthe given software. It should be expressly understood that how thedisassembling is executed is not limited; and in some non-limitingembodiments of the present technology, the processor 401 can beconfigured to utilize a disassembler configured to translate themachine code into the assembly language or IL language set ofinstructions. In specific non-limiting embodiments of the presenttechnology, the disassembler may include, without limitation, atleast one of: an IDA.TM. Pro disassembler, a Sourcer.TM.disassembler, and the like. As a result of applying thedisassembler, the machine code becomes marked: boundaries of eachone of the plurality of functions associated with the givensoftware can thus be explicitly marked in it. Further, in somenon-limiting embodiments of the present technology, portions of theassembly code within respective boundaries are saved in a dedicatedtext format file. The remainder of the assembly code (beyond theboundaries) can be discarded from further processing as consideredto include in significant information on the affiliation of thegiven software to the target software.

[0068] Thus, in some non-limiting embodiments of the presenttechnology, the plurality of functions associated with the givensoftware can be represented in one of the machine code and theassembly code.

[0069] Further, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to analyze theplurality of functions associated with the given software to detecttherein and delete therefrom certain standard functions notindicative of the affiliation of the given software to the targetsoftware. For example, the processor 401 can be configured toidentify, within the plurality of functions, library functions. Thelibrary functions are widely used by a variety of software andprograms, therefore, their presence in the machine code and/or theassembly code associated with the given software is not specific tocertain software families and/or authorship. According to certainnon-limiting embodiments of the present technology, excluding thelibrary functions may enable to simplify analysis significantlyand, at the same time, obtain better training results due to thefact that the decision rules are trained based on commands uniquelyassociated with the given software under analysis.

[0070] According to certain non-limiting embodiments of the presenttechnology, the signature analysis and the deleting, the libraryfunctions can be performed by the processor 401 executing anauxiliary script. An algorithm of the auxiliary script couldrepresent, for example, a sequential comparison of each functionwithin the plurality of functions associated with the givensoftware with a prearranged set of signatures (regularexpressions). Each of these signatures corresponds to a specificlibrary function preliminarily described as a signature; when afunction corresponding to any signature is detected, a wholeportion of the machine code and/or the assembly code composing afunction body and header of the function are deleted. Uponcompletion of processing the plurality of functions by theauxiliary script, in some non-limiting embodiments of the presenttechnology, the processor 401 may be configured to update theplurality of functions associated with the given software and savethe so updated plurality of functions for further processing, aswill be described below.

[0071] In some non-limiting embodiments of the present technology,the processor 401 may further be configured to identify, within theplurality of functions, and delete therefrom portions of themachine code and/or the assembly code inherently non-indicative ofthe affiliation to the target software. In the context of thepresent specification, machine code portions can be inherentlynon-indicative of the affiliation to the target software, by virtueof the portions of the machine code associated with the givensoftware not being specific enough for further analysis of thegiven software for the affiliation thereof to the target software(for example, due to the fact that they often occur in differentsoftware and, thus, are not indicative enough of affiliation to anygiven software); and thus these machine code portions can beomitted from the machine code without affecting the accuracy ofsuch analysis. For example, inherently non-indicative machine codeportions may include those indicative of function prologues offunctions within the assembly code.

[0072] To that end, the processor 401, according to certainnon-limiting embodiments of the present technology, after executingstep 130 of the first method 100, can be configured to execute asecond method 200, a flowchart diagram of which is depicted in FIG.2, and to which reference is currently being made.

Step 210: Receiving a Plurality of Machine Code Samples

[0073] The second method 200 commences at step 210 with theprocessor 401 being configured to receive a plurality of machinecode samples associated with various software. In certainnon-limiting embodiments of the present technology, the pluralityof machine code samples may include, for example hundreds,thousands, or even hundreds of thousands of machine code samplesdifferent in functionality and having been developed by differentteams of developers.

[0074] The second method 200 further proceeds to step 220.

Step 220: Identifying a List of Inherently Non-Indicative MachineCode Portions and Determine Frequency of Occurrence Thereof withinthe Plurality of Machine Code Samples

[0075] At step 220, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured toidentify, within the machine code, inherently non-indicativemachine code portions repeatedly occurred therewithin. In somenon-limiting embodiments of the present technology, the processor401 can further be configured to determine a frequency ofoccurrence of each inherently non-indicative machine code portionwithin the plurality of machine code samples. For example, in somenon-limiting embodiments of the present technology, the processor401 can be preliminarily provided with a minimum sequence lengthvalue, e.g., 20 symbols, for identifying the inherentlynon-indicative machine code portions within the plurality ofmachine code samples. In some non-limiting embodiments of thepresent technology, a maximum sequence length value may not bepreset. In alternative non-limiting embodiments of the presenttechnology, the maximum sequence length can be predetermined to befrom 15 to 250 symbols, as an example. Thus, the processor 401 canbe configured to generate a list of inherently non-indicativemachine code portions with associated respective frequencies ofoccurrences thereof.

[0076] The second method 200 thus proceeds to step 230.

Step 230: Selecting a Sub-Plurality of Most Frequent InherentlyNon-Indicative Machine Code Portions within the List of InherentlyNon-Indicative Machine Code Portions

[0077] Further, at step 230, the processor 401 can be configured toselect, from the list of inherently non-indicative machine codeportions generated at the previous steps, based on the associatedrespective frequencies of occurrences, a sub-plurality of mostfrequent inherently non-indicative machine code portions. Forexample, a given inherently non-indicative machine code portion hasbeen occurred once in each one of the plurality of machine codesamples including, for example, 100 machine code samples, whichamounts to 100 occurrences of the given inherently non-indicativemachine code portion therewithin.

[0078] In some non-limiting embodiments of the present technology,the processor 401 can be configured to select the sub-plurality ofmost frequent inherently non-indicative machine code portions basedon a predetermined frequency threshold value, which can bedetermined, for example, based on a number of machine code sampleswithin the plurality of machine code samples including the giveninherently non-indicative machine code portion.

[0079] The second method 200 thus advances to step 240.

Step 240: Generating, Based on the Sub-Plurality of Most FrequentInherently Non-Indicative Machine Code Portions, a ScriptConfigured to Identify and Delete Inherently Non-Indicative MachineCode Portions from a Given Software

[0080] At step 240, according to certain non-limiting embodimentsof the present technology, the processor 401 may be configured toidentify, in the machine code associated with the given software,based on the sub-plurality of most frequent inherentlynon-indicative machine code portions, inherently non-indicativemachine code portions and delete them therefrom.

[0081] In some non-limiting embodiments of the present technology,as an example, the processor 401 can be configured, based on thesub-plurality of inherently non-indicative machine code portionsselected at step 230, to generate a specific program script, whichcan further be used for identifying and deleting inherentlynon-indicative machine code portions from various software.

[0082] Thus, by executing the second method 200, the processor 401can be configured to delete the inherently non-indicative machinecode portions from the machine code associated with the givensoftware at step 130 of the first method 100. Accordingly, in somenon-limiting embodiments of the present technology, upon completionof the second method 200, the processor 401 can be configured toproceed with executing the first method 100.

[0083] According to certain non-limiting embodiments of the presenttechnology, such code portions can be identified and deleted fromthe assembly code associated with the given software, as well,without departing from the scope of the present technology, towhich the above definition and the second method 200 can applymutatis mutandis.

[0084] The second method 200 thus terminates.

[0085] Thus, at step 130, the plurality of functions associatedwith the given software does not include the library functions andinherently non-indicative machine code portions

[0086] The first method 100 hence advances to step 140.

Step 140: Parsing, by the Processor, the at Least One Function toIdentify Therein at Least One Function Command

[0087] Referring back to FIG. 1, at step 140, according to certainnon-limiting embodiments of the present technology, the processor401 can be configured to identify within a given one of theplurality of functions associated with the given software at leastone function command.

[0088] According to certain non-limiting embodiments of the presenttechnology, the at least one function command can comprise a givenpair "action-argument" further including an action and at least oneargument associated with the at least one action.

[0089] As alluded to above, each of the functions represented inthe machine code can be disassembled using the disassembler. Forexample, at disassembling the following portion of the machine codeassociated with the given one of the plurality of functions: [0090]. . . D6 00 C7 05 3C 3F 42 00 00 00 01 00 FF FF 00 . . . theIDA.TM. Pro disassembler can be configured to identify the at leastone function command (that is, "mov"): [0091] mov dword_423F3C,10000h which has the following view in the machine code: [0092] C705 3C 3F 42 00 00 00 01 00.

[0093] Further, according to certain non-limiting embodiments ofthe present technology, within the at least one function command,the processor 401 can be configured to identify the at least oneargument associated with the at least one function command, whichis, in the example above, indicated as following the decimal pointaccording to the assembler syntax: [0094] mov dword 423F3C,10000h

[0095] In alternative non-limiting embodiments of the presenttechnology, the processor 401 can be configured to skip theprocedure of the identifying the at least one argument associatedwith the at least one functions command.

[0096] In accordance with certain non-limiting embodiments of thepresent technology, the at least one function command ("mov", inthe example above) is not used for further analysis as the functioncommands, per se, may not be indicative of the machine code theyare derived from--for example, around 15-17 substantially differentassociated machine code portions may correspond to the "mov"function command.

[0097] Therefore, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to: (1) selectportions from the machine code corresponding to function commands;(2) save them, for example, each on a separate line; and (3)analyze the so generated list of the function commands to detectassociated actions (since respective arguments have already beenidentified). To that end, in some non-limiting embodiments of thepresent technology, the processor 401 can be configured to apply ascript specifically configured for this purpose. An algorithm ofthis script can be configured to review the portions of the machinecode respectively associated with the function commands based onthe specification of the used architecture; in the present example,it is the x86 architecture.

[0098] Thus, in the example above, the script may be configured toexecute the following verifications: [0099] whether a first byte ofa machine code portion associated with the at least one functioncommand is one of the prefixes specified for the x86 architecture,and [0100] whether the first byte is an indicator of a two-byteoperation.

[0101] In the example above, both verifications returned negativeresults, and the script thus proceeds with reviewing the machinecode portion associated with the at least one function command. Thescript, in accordance with x86 architecture specification, can beconfigured to interpret the first byte in this machine codeportion, C7h, as an operation code, the complete view of whichshould be specified depending on the contents of the next, i.e. asecond byte 05h. The script, in accordance with the specificationof x86 architecture, thus extracts the field contents (reg) of thesecond byte: [0102] 05h=(mod)00 (reg)000 (r/m)101 and adds it tothe operation code. Thus, the operation code acquires the followingview: [0103] C7h 000b, which is further saved, for example, inassociation with the at least one argument 10000h identified andstored before.

[0104] Further, in accordance with the x86 architecturespecification, for this operation code the field contents (mod) and(r/m) of the next byte indicate that the following four bytes ofthe machine code portion under analysis are 3C 3F 42 00, and, inthe present example, are used to define a command address as a DSregister offset. In some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to ignore thecommand address, and therefore, in the present example, the fourbytes of the machine code portion indicative of the command addresscan thus be discarded from further analysis.

[0105] Finally, in accordance with certain non-limiting embodimentsof the present technology, the last four bytes of the machine codeportion of the present example, that is, 00 00 01 00, arerepresentative of the at least one argument 10000h, which has beenalready extracted from the disassembling results. Therefore,further analysis of this machine code portion is not performed.

[0106] In alternative non-limiting embodiments of the presenttechnology, where the processor 401 is configured to skip theprocedure of identifying the at least one argument, the last fourbytes of the machine code portion (00 00 01 00) can further be usedfor the identifying the at least one argument. For example, giventhe fact that, in the x86 architecture, the little-endian notationis used for recording the machine code, the at least one argumentcan be identified by the following conversion: [0107] 00 00 0100.fwdarw.10000h.

[0108] Thus, in the example above, from the machine code portion ofthe at least one function command under analysis C7 05 3C 3F 42 0000 00 01 00, the given "action-argument" pair can thus be obtainedand stored in a dedicated file: [0109] C7h 000b 10000h.

[0110] The first method 100 thus proceeds to step 150.

Step 150: Generating, by the Processor, for Each One of thePlurality of Functions Associated with the Given Software, aRespective Function Identifier

[0111] According to certain non-limiting embodiments of the presenttechnology, at step 150, the processor 401 can be configured, basedon the given pair "action-argument", to generate a respectivefunction identifier. In the context of the present specification,the term "function identifier" denotes a number sequence generatedby the processor 401 for a given pair "action-argument" andassociated with a respective one of the plurality of functions ofthe given software. In various non-limiting embodiments of thepresent technology, the number sequence associated with the givenpair "action-argument" can be represented as a decimal number, ahexadecimal number, a binary number, and the like.

[0112] In some non-limiting embodiments of the present technology,the generating the respective function identifier can includeapplying, by the processor 401, one or more hash functions to thegiven pair "action-argument". In the other non-limiting embodimentsof the present technology, the processor 401 can be configured toconvert the number sequence associated with the given pair"action-argument" into a decimal record format and concatenate theso generated decimal numbers, thereby generating a single decimalnumber being the respective function identifier: [0113] C7h 000b10000h.fwdarw.199 0 65536.fwdarw.199065536

[0114] Thus, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to generate, forpairs "action-argument" associated with respective ones of theplurality of functions of the given software, a plurality offunction identifiers being respective number sequences s describedabove.

[0115] The first method 100 hence advances to step 160.

Step 160: Aggregating, by the Processor, Respective FunctionIdentifiers of the Plurality of Functions Associated with the GivenSoftware, Thereby Generating an Aggregated Array of FunctionIdentifiers Associated with the Given Software

[0116] At step 160, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to aggregate theplurality of function identifiers into an aggregated array offunction identifiers. To that end, each one of the plurality offunction identifiers may be represented as Pij, where i indicates asequential number of the respective one of the plurality offunctions, in which the given "action-argument" pair has beenidentified, and j indicates a sequential number of the given"action-argument" pair within the respective function, in which thegiven pair has been detected.

[0117] As it may become apparent, machine codes of the majority ofmodern software programs can include a considerable number offunctions. Therefore, in certain non-limiting embodiments of thepresent technology, the machine code of the given software can betransformed into an aggregated array of function identifiersincluding n lines, wherein n is indicative of a total number ofnon-library functions having been identified during the aboveanalysis:

F1=P11,P12,P13, . . . P1j . . . ,P1a . . .

Fi=Pi1,Pi2,Pi3, . . . Pij . . . ,Pib . . .

Fn=Pn1,Pn2,Pn3, . . . Pnj, . . . ,Pnc. (1)

[0118] As it can be appreciated from the above, indices a, b, and cin Equation (1) are indicative of different pairs "action-argument"within respective functions, in a general case.

[0119] The first method 100 thus proceeds to step 170.

Step 170: Applying, by the Processor, at Least One Classifier tothe Aggregated Array of Function Identifiers to Determine aLikelihood Parameter Indicative of the Given Software beingAffiliated to a Respective Target Software

[0120] At step 170, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured toanalyze the aggregated array of function identifiers expressed byEquation (1) to determine the affiliation of the given software tothe target software. To that end, the processor 401 can beconfigured to feed the aggregated array of function identifiers toone or more classifiers having been trained to determine theaffiliation with the target software. How the one or moreclassifiers can be trained, in accordance with certain non-limitingembodiments of the present technology, will be described below withreference to FIG. 3.

[0121] Further, in accordance with certain non-limiting embodimentsof the present technology, the one or more classifiers, whenapplied to the aggregated array of function identifiers, may beconfigured to generate a likelihood parameter, which may beexpressed, for example, as a numerical estimate of probability,that the given software is affiliated to the target software. Thelikelihood parameter can be constantly updated, i.e. reviewed atinput of each portion of the machine code represented by Equation(1).

[0122] The first method 100 hence proceeds to step 180.

Step 180: In Response to the Likelihood Parameter being Equal to orGreater than a Predetermined Likelihood Parameter Threshold:Identifying the Given Software as being Affiliated to theRespective Target Software; Storing Data Indicative of the GivenSoftware in a Database of Affiliated Software; and Using the DataIndicative of the Given Software for Further DeterminingAffiliation to the Respective Target Software

[0123] At step 180, in response to the likelihood parametergenerated by the one or more classifiers being equal to or greaterthan a predetermined likelihood parameter threshold value, theprocessor 401 can be configured to identify the given software asbeing affiliated to the target software. As noted hereinabove, insome non-limiting embodiments of the present technology, the targetsoftware may include software of a predetermined software familyand/or of a predetermined authorship.

[0124] Further, in certain non-limiting embodiments of the presenttechnology, the processor 401 may be configured to store dataindicative of the given software in a dedicated database forfurther use. For example, the processor 401 can be configured touse the data indicative of the given software to train the one ormore classifiers to determine affiliation of other software to thegiven software.

[0125] In other non-limiting embodiments of the present technology,where the likelihood parameter is below the predeterminedlikelihood parameter threshold value, the processor 401 can beconfigured to determine that the given software is not affiliatedto the target software; and thus, the processor 401 would notproceed to store the data indicative of the given software forfurther use.

[0126] The first method 100 thus terminates.

Training Classifier

[0127] As alluded to above, according to certain non-limitingembodiments of the present technology, the processor 401 can beconfigured to train the one or more classifier to determine theaffiliation with respective target software, which may further beused in the first method 100. With reference now to FIG. 3, thereis depicted a flowchart diagram of a third method 300 for traininga classifier to determine the affiliation with a given targetsoftware, in accordance with certain non-limiting embodiments ofthe present technology.

Step 310: Receiving, by the Processor, a Plurality of TargetSoftware Files, Each Target Software File Including a RespectiveTarget Machine Code Associated with the Respective TargetSoftware

[0128] The third method 300 commences at step 310 with theprocessor 401 being configured to receive a plurality of targetsoftware files including associated target machine codes associatedwith the given target software. According to certain non-limitingembodiments of the present technology, a total number of targetmachine codes in the plurality of the received target softwarefiles can be predetermined and comprise, for example, withoutlimitation around 30-70 target machine codes of a predeterminedsoftware family to which the given target software belongs.Alternatively, the processor 401 can be configured to receivearound 20-30 target machine codes of a predetermined authorshipassociated with the given target software. Further, the processor401 can be configured to analyze each one of the plurality oftarget software files, for example, sequentially.

[0129] The third method 300 hence advances to step 320.

Step 320: Determining, by the Processor, for Each One of thePlurality of Target Software Files, a Respective Target FileFormat

[0130] At step 320, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured todetermine a respective file format of each one of the plurality oftarget software files. In these embodiments, the processor 401 canbe configured to execute step 320 similar to executing step 120 ofthe first method 100 described above.

[0131] Further, as described above in respect of step 120 of thefirst method 100, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to determine if agiven one of the plurality of target software files associated withthe given target software has been processed by one of thefollowing predetermined processes: encryption, compression, andobfuscation. In response to determining that the given one of theplurality of target software files has been processed, theprocessor 401 can be configured to execute the given one of theplurality of target software files in the isolated environment torestore an associated target machine code using one or more memorydumps generated therein in a runtime memory available to theprocessor 401.

[0132] The third method 300 thus proceeds to step 330.

Step 330: Identifying, by the Processor, Based on the RespectiveTarget File Format Associated with a Given One of the Plurality ofTarget Software Files, in a Respective Target Machine Code, atLeast One Target Function

[0133] At step 330, the processor 401 can be configured toidentify, in given one of a plurality of target machine codesrespectively associated with the plurality of target software filesof the given target software, a respective plurality of targetfunctions. This step can be executed substantially similar to step130 of the first method 100 described above.

[0134] Further, as described above, the processor 401 can beconfigured to identify and delete from the respective plurality oftarget functions associated library functions and machine codeportions inherently non-indicative of the affiliation of the givenone of the plurality of target software files to the given targetsoftware. In some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to identify anddelete the latter from the given one of the plurality of targetsoftware files by executing the second method 200 described abovewith reference to FIG. 2.

[0135] Thus, the so refined respective plurality of targetfunctions can further be processed.

[0136] The third method 300 thus proceeds to step 340.

Step 340: Parsing, by the Processor, the at Least One TargetFunction to Identify Therein at Least One Target FunctionCommand

[0137] Akin to executing step 140 of the first method 100 describedabove, at step 340, the processor 401 can be configured to parsethe at least one target function to identify therein at least onetarget function command. Accordingly, as described above, the atleast one function command may further comprise at least one targetpair "action-argument" including a given target action and a targetargument associated therewith.

[0138] The third method 300 thus proceeds to step 350.

Step 350: Generating, by the Processor, Based on Each of the atLeast One Target Function Command, a Respective Target FunctionIdentifier Associated with the at Least One Target Function, theRespective Target Function Identifier Comprising an AssociatedNumber Sequence

[0139] At step 350, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured togenerate for the at least one target pair "action-argument" arespective target function identifier. In some non-limitingembodiments of the present technology, the processor 401 can beconfigured to generate the respective target function identifier asa respective number sequence, similar to generating the respectivefunction identifier as described above in respect of step 150 ofthe first method 100.

[0140] Further, the processor 401 can be configured to save therespective target function identifier associated with the at leastone target pair "action-argument" for further use.

[0141] The third method 300 further advances to step 360.

Step 360: Aggregating, by the Processor, Number SequencesAssociated with Respective Target Functions Over the Plurality ofTarget Software Files, Thereby Generating a Number Array Associatedwith the Respective Target Software

[0142] At step 360, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured toaggregate target function identifiers over the plurality of targetsoftware files to generate a target number array associated withthe given target software. For example, the processor 401 can beconfigured to aggregate the target function identifiers in an orderof occurrence of functions associated therewith as described abovewith respect to step 160 of the first method 100.

[0143] Thus, the target number array is associated with theplurality of target software files associated with one of thepredetermined software family and the predetermined authorship.

[0144] The method thus proceeds to step 370.

Step 370: Identifying, by the Processor, in the Number ArrayAssociated with the Respective Target Software, at Least OnePattern

[0145] At step 370, according to certain non-limiting embodimentsof the present technology, the processor 401 can be configured toidentify, in the target number array, at least one patternassociated with the given target software. In some non-limitingembodiments of the present technology, the at least one patterncomprises a predetermined repetitive number sequence within thetarget number array. Thus, in these embodiments, the predeterminedrepetitive number sequence can be said to be indicative of afrequency of occurrence of the at least one target pair"action-argument" within the given target software.

[0146] According to certain non-limiting embodiments of the presenttechnology, a length of the predetermined repetitive numbersequence, i.e. a number of symbols therein, can be predetermined.Thus, in some non-limiting embodiments of the present technology,the length of the predetermined repetitive number sequence could bebased on an interval, for example, from 4 to 10 symbols within thetarget number array or, alternatively, for example, from 60 to 80symbols within the target number array. In other non-limitingembodiments of the present technology, the length of thepredetermined repetitive number sequence could be predetermined asa constant number, e.g. 40 symbols within the target number arrayassociated with the given target software.

[0147] In yet other non-limiting embodiments of the presenttechnology, the length of the predetermined repetitive numbersequence could be determined iteratively, based on a current numberof such a predetermined repetitive number sequence within theaggregated array of target identifiers. In these embodiments, asearch begins, for example, at an initial length of 8 symbols. Oncea number of identified number sequences of the initial lengthexceeds a predetermined pattern threshold value (100, as anexample), the processor 401 can be configured to increase theinitial length by one, and the search starts over omitting shorternumber sequences detected before. Such cycle is repeated until thenumber of patterns of a maximum possible length less than thepredetermined pattern threshold value is identified. Thus, the atleast one pattern may further be part of a training set of data fortraining the classifier.

[0148] In some non-limiting embodiments of the present technology,the processor 401 can be configured to assign to the at least onepattern a respective weight value.

[0149] In some non-limiting embodiments of the present technology,the respective weight value can be determined based on types ofcommands and operations associated with the at least one pattern.For example, the respective weight value can be 2 times exceedingrespective weight values of other patterns if the at least onepattern is indicative of commands associated with at least one mathoperation; or, in other implementations, if around 80% and more ofthe commands associated with the at least one pattern include mathoperations. In another example, the respective weight value can be,e.g. 3 times exceeding weight values of other patterns, if the atleast one pattern is indicative of at least two four-byteconstants.

[0150] By contrast, the respective weight value can be decreased,e.g. can comprise 0.3 of weight values of other patterns, if the atleast one pattern includes symbols indicative of neither commandswith math operations nor four-byte constants.

[0151] Further, in some non-limiting embodiments of the presenttechnology, the processor 401 can be configured to determine afrequency of occurrence of the at least one pattern within thetarget number array. Broadly speaking, the frequency of occurrenceof the at least one pattern can be a numeric value indicating howoften the at least one pattern occurs in the plurality of targetsoftware files associated with the given target software, i.e. howoften an associated set of commands occurs within the given targetsoftware.

[0152] In some non-limiting embodiments of the present technology,the frequency of occurrence of the at least one pattern can bedetermined according to the following equation:

.lamda.=L/K, (2)

where L is a frequency of occurrence of the at least one patternwithin the target number array of target function identifiersassociated with the given target software, and K is a number oftarget software files in the plurality of target software filesincluding associated machine codes of the given targetsoftware.

[0153] As it can be appreciated, the frequency of occurrence of theat least one pattern can be less than 1 if the at least one patterndoes not occur in each and every one of the plurality of targetsoftware files; and can be greater than 1 if there are severaloccurrences of the at least one pattern in each one of theplurality of target software files, as an example.

[0154] In some non-limiting embodiments of the present technology,the respective weight value to be assigned to the at least onepattern may be based on the frequency of occurrence thereofdetermined in accordance with Equation (2).

[0155] The third method 300 thus proceeds to step 380.

Step 380: Storing the at Least One Pattern with a Label Indicativeof an Association Between the at Least One Pattern and theRespective Target Software for Inclusion Thereof into the TrainingSet of Data

[0156] Further, at step 380, in some non-limiting embodiments ofthe present technology, the processor 401 can be configured toassign the at least one pattern with a label indicative of anassociation between the at least one pattern and the given targetsoftware. Thus, the processor 401 can be configured to store the atleast one pattern associated with the label and the respectiveweight value in the training set of data used for training theclassifier.

[0157] The third method 300 finally advances to step 390.

Step 390: Training the Classifier, Based on the Training Set ofData, to Determine the Affiliation of a Given Software to theRespective Target Software.

[0158] At step 390, the processor 401 can be configured to trainthe classifier, based on the so generated training set of data, todetermine the affiliation to the given target software. It shouldbe expressly understood that it is not limited as to how theclassifier can be implemented, and in various non-limitingembodiments of the present technology, the classifier can beimplemented, for example, as one of a probabilistic graph model(Random Forest) and as a SVM-classifier.

[0159] In specific non-limiting embodiments of the presenttechnology, the processor 401 can be configured to train theclassifier using one or more machine-learning techniques.

[0160] The third method 300 hence terminates.

Computing Environment

[0161] With reference to FIG. 4, there is depicted an examplefunctional diagram of the computing device 400 configurable toimplement certain non-limiting embodiments of the presenttechnology including the first method 100, the second method 200,and the third method 300 described above.

[0162] In some non-limiting embodiments of the present technology,the computing device 400 may include: the processor 401 comprisingone or more central processing units (CPUs), at least onenon-transitory computer-readable memory 402 (RAM), a storage 403,input/output interfaces 404, input/output means 405, datacommunication means 406.

[0163] According to some non-limiting embodiments of the presenttechnology, the processor 401 may be configured to execute specificprogram instructions the computations as required for the computingdevice 400 to function properly or to ensure the functioning of oneor more of its components. The processor 401 may further beconfigured to execute specific machine-readable instructions storedin the at least one non-transitory computer-readable memory 402,for example, those causing the computing device 400 to execute oneof the first method 100, the second method 200, and the thirdmethod 300.

[0164] In some non-limiting embodiments of the present technology,the machine-readable instructions representative of softwarecomponents of disclosed systems may be implemented using anyprogramming language or scripts, such as C, C++, C#, Java,JavaScript, VBScript, Macromedia Cold Fusion, COBOL, MicrosoftActive Server Pages, Assembly, Perl, PHP, AWK, Python, VisualBasic, SQL Stored Procedures, PL/SQL, any UNIX shell scrips or XML.Various algorithms are implemented with any combination of the datastructures, objects, processes, procedures and other softwareelements.

[0165] The at least one non-transitory computer-readable memory 402may be implemented as RAM and contains the necessary program logicto provide the requisite functionality.

[0166] The storage 403 may be implemented as at least one of an HDDdrive, an SSD drive, a RAID array, a network storage, a flashmemory, an optical drive (such as CD, DVD, MD, Blu-ray), etc. Thestorage 403 may be configured for long-term storage of variousdata, e.g., the aforementioned documents with user data sets,databases with the time intervals measured for each user, user IDs,etc.

[0167] The input/output interfaces 404 may comprise variousinterfaces, such as at least one of USB, RS232, RJ45, LPT, COM,HDMI, PS/2, Lightning, FireWire, etc.

[0168] The input/output means 405 may include at least one of akeyboard, a joystick, a (touchscreen) display, a projector, atouchpad, a mouse, a trackball, a stylus, speakers, a microphone,and the like. A communication link between each one of theinput/output means 405 can be wired (for example, connecting thekeyboard via a PS/2 or USB port on the chassis of the desktop PC)or wireless (for example, via a wireless link, e.g., radio link, tothe base station which is directly connected to the PC, e.g., to aUSB port).

[0169] The data communication means 406 may be selected based on aparticular implementation of a network, to which the computingdevice 400 can have access, and may comprise at least one of: anEthernet card, a WLAN/Wi-Fi adapter, a Bluetooth adapter, a BLEadapter, an NFC adapter, an IrDa, a RFID adapter, a GSM modem, andthe like. As such, the connectivity hardware 404 may be configuredfor wired and wireless data transmission, via one of a WAN, a PAN,a LAN, an Intranet, the Internet, a WLAN, a WMAN, or a GSM network,as an example.

[0170] These and other components of the computing device 500 maybe linked together using a common data bus 410.

[0171] It should be expressly understood that not all technicaleffects mentioned herein need to be enjoyed in each and everyembodiment of the present technology.

[0172] Modifications and improvements to the above-describedimplementations of the present technology may become apparent tothose skilled in the art. The foregoing description is intended tobe exemplary rather than limiting. The scope of the presenttechnology is therefore intended to be limited solely by the scopeof the appended claims.

* * * * *

Method And System For Determining Affiliation Of Software To Software Families Patent Application (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Margart Wisoky

Last Updated:

Views: 5872

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.