ONTOLOGICAL APPROACH TO BIG DATA ANALYTICS IN CYBERSECURITY

Information security is a dynamic field in which methods and means of protection against threats and their destructive component are rapidly changing and improving, which is a challenge for organizations and society as a whole. Therefore, information systems related to cybersecurity require a constant flow of knowledge from internal and external sources, the volume of which is constantly growing. The introduction of big data sets in the field of cybersecurity provides opportunities for application for the analysis of data containing structured and unstructured data. The application of semantic technologies to search, selection of external big data, and description of knowledge about the cybersecurity domain require new approaches, methods, and algorithms of big data analysis. For selecting relevant data, we are offered a semantic analysis of metadata that accompanies big data and the construction of ontologies that formalize knowledge about metadata, cybersecurity, and the problem that needs to be solved. We are proposed to create a thesaurus of problems based on the domain ontology, which should provide a terminological basis for the integration of ontologies of different levels. The cybersecurity domain has a hierarchical structure, so the presentation of formalized knowledge about it requires the development of the hierarchy of ontologies from top to bottom. For building a thesaurus of problem, it is proposed to use an algorithm that will combine information from information security standards, open natural information resources, dictionaries, and encyclopedias. It is suggested to use semantically marked Wiki-resources, external thesauri, and ontologies to supplement the semantic models of the cybersecurity domain. Problem Statement. For today’s, the issue of information security

data, the greater their usefulness. The value also depends on the data processing time, as the analytical results have a certain shelf life. Also in [11] identified the main problems that exist today in big data technology and need to be addressed. Analysis of scientific publications [12] shows that the question of the relevance of metadata used in big data is more acute than ever, so new strategies and approaches are being developed today.
The purpose of the article is to develop a thesaurus of the task to find the appropriate big data (an existing problem for solving), the new terms of which will supplement the terminological set of the domain ontology and establishing semantic connections between them. And also develop the generation algorithm to build such thesaurus with using of information security standards, open dictionaries on information security and encyclopedias, and descriptions of competencies of IS specialists are proposed. Thesaurus IS will allow displaying the most important knowledge of the area for the task, but the time for its processing and construction will be much less than the time of comparing ontologies with unstructured natural language text. Such thesaurus will significantly speed up the analysis of descriptions of learning outcomes.
The main material research. The introduction of big data sets in the field of IS opens up opportunities for the analysis of very large sets containing both structured and unstructured data. The availability of big data sets has created difficulties that we have to deal with not only in terms of semantics and analytics but also in terms of data management, storage, and distribution. However, an ontological approach based on analytical data provides a practical basis for addressing the semantic challenges presented by data sets. The life cycle of big data analytics (see Fig. 1 Analysis of the problem. The IS big data analytics lifecycle begins with the rationale, motivation, and purpose of the analysis. This analysis allows you to determine the type of big data to be used (batch, transactional, internal, external).
Data identification. The data identification stage defines the data sets required for analytical calculations and their sources. Using a wider range of data sources can increase the likelihood of finding hidden patterns and correlations.
Collecting and filtering data. In this step, data is collected from all data sources that were identified in the previous step. The data is then filtered to remove corrupted data or data that is not relevant for analysis purposes.
Metadata (see Fig. 2) can be added to large internal data or external data to improve knowledge about them, their classification, and queries. Examples of added metadata include the size and structure of the dataset, source information, creation or collection date and time, and language-specific information. It is very important that the metadata is machine-readable and passed on to subsequent stages of analysis.
Data transformation (data extraction). This step is designed to transform big data into a format that is used by the underlying analytics software.
Data validation and cleaning. Incorrect data can distort and falsify analysis results. This step is designed to create complex validation rules and remove any known invalid data. Big data solutions often get redundant data across different datasets.

123
Aggregation and presentation of data. This stage is designed to integrate multiple datasets together to achieve a unified view. Data can be spread across multiple datasets, requiring datasets to be combined through common fields, such as date or ID.  This step can be complicated by differences in: data structurealthough the data format may be the same, the data structure model may be different; semanticsa value marked differently in two different datasets can mean the same thing, for example, "surname" and "last name".
data analysis. The data analysis phase is about performing the actual analysis task, which usually includes one or more types of analytics. This stage can be iterative, the analysis is repeated until a matching pattern or correlation is found. Data visualization. The ability to analyze huge amounts of data and come up with useful insights doesn't matter if the analysts are the only ones who can interpret the results.
The semantic approach is used at all stages of the big data life cycle. Ontologies are widely used now in distributed intelligent applications to explicitly describe the domain knowledge system or information resource. Domain ontologies and task thesauri are the main semantic elements of metadata analysis. In the general case, ontology is an agreement on the shared use of concepts that provides the means of domain knowledge representation and agreement about their understanding. IS ontologies now become an active provider of data element relationships that can use machine learning and artificial intelligence algorithms to adapt to changes in the environment [13].
To create an ontology of the entire IS domain, it is necessary to integrate existing ontologies and improve them.
Unified IS ontology (UCO) [7]. Is designed to support the integration of knowledge in IS systems and should unify the most widely used information security standards. The ontology includes and integrates disparate data and knowledge schemes from different IS subsystems and is the most commonly used IS standards for sharing and sharing. The UCO can serve as a knowledge core for the IS domain.
A detailed description of the knowledge about the IS domain requires the development of a hierarchy of ontologies, starting from the top level to the bottom. The top-level ontology includes the basic concepts of the domain, which have previously been defined in ontologies on this topic. Below are mid-level ontologies that focus on the user, events, network operations, and geospatial data related to IS. Lower-level ontologies describe specific IS domains that require an industry-specific solution.
In the field of IS, a large number of ontologies have already been created that reflect various individual aspects of this subject area. For example, researchers have developed application ontologies to identify and classify network attacks: an ontology for distinguishing network security status [14]; ontology of intrusion detection [4]; ontology for automated classification of network attacks [15]; ontology for predicting potential network attacks [16].
Other ontologies can provide an adaptive vocabulary that can improve behavioral analysis and help stop the spread of threats. Terms for such IS ontologies can be obtained from open sources, such as a dictionary of IS terms [17] and the standards of this subject area.
This information, provided in Web Ontology Language (OWL), can be reused and integrated into a variety of applications. In Fig. 3, a fragment of such an ontology of upper level IS is given. It is easier than from unstructured National League (NL) documents to extract information from those information resources (IR) that contain semantic markup. Examples of such IPs are semantized Wiki resources. Links between Wiki pages for which the content is explicitly defined can be used to build the ontology For example, on the portal of the Great Ukrainian Encyclopedia [19]. For this, you can use the pages of the category "Information security systems".
Ontology is a knowledge base that describes facts that are always assumed to be true within a particular community based on the generally accepted meaning of the thesaurus. Since thesaurus is a special case of ontology, which allows representing concepts so that they become suitable for machining and automated processing. It can be considered as a model of the logical-semantic structure of domain terminology. In the work [20] it is proposed to use a thesaurus approach to formalize the terminology of the subject area in the field of IS. Thesaurus IS reflects a wide range of essential properties, features, and relationships inherent in this specific type of security.
The task thesaurus is a special case of the subject area ontology, which contains only ontological terms (classes and instances), but does not describe (or limitedly describes) the semantics of the relationship between them to analyze natural language texts. It can be automatically generated by the ontology of the subject area and natural language description of the problem [21]. A simple thesaurus of the task is a thesaurus based on the terms of one ontology of the subject area. A compiled thesaurus of the task is a thesaurus based on the terms of two or more ontologies of the subject area.
Formal models either of ontologies or of thesauruses include as the basic concept the terms and connections between these terms. The collection of the domain terms with the indication of the semantic relations between them is a domain thesaurus. A formal model of thesaurus is based on formal model of ontology: ,, The user has to formalize task if he/she needs the personified processing of information. The domain of task is formally characterized by domain ontology, and the task itself can be characterized formally by use of task thesaurus or informallyby its NL description, keywords, or example documents. The task thesaurus can be either built by the user manually or generated automatically by analysis of available NL documents and other IRs. For construction of the task thesaurus, every IR is described by not empty set of the textual documents connected with this IRtext of content, meta descriptions, results of indexing etc. If IR contains multimedia content then this content can be transformed into text (by speech and text recognition methods etc.) methods. The algorithm of IR thesaurus generation has the following steps: 1. Formation of initial non-empty set A of the textual documents i a connected with this IR as an input data for the algorithm. 3. Generation of IR thesauruses (see Fig. 4). With the use of domain ontology IR, thesaurus IR T is created as a projection of the set of ontological concepts X into the set IR D . X T IR  . This step of processing is aimed to remove stop-words and terms from other domains that are not interesting for the user. The main problem deals with semantic connection of NL fragments (words) from IR T with concepts from the set X of domain ontology O. This problem can be solved by linguistic methods that use lexical knowledge bases for every NL and is beyond the scope of this article. Each word from the thesaurus is necessary to link with one of the ontological terms. If the relationship is lacking the word is considered as a stop-word or marking element (for example, HTML tag) and should be rejected.
The group of the IR thesaurus words terms connected with one ontological term named the semantic bunch n j R j , 1 ,  is considered as a single unit: . It allows to integrate processing of semantics of the documents written in various languages and, thus, to ensure the multilinguistic analysis of the Internet IR.
If user doesn't define domain ontology O then we consider that user domain of interests has no restrictions and therefore we don't remove any elements from IR dictionary:  The theoretic basis of ontology-based thesaurus generation is semantic similarity estimations. Semantically similar concepts (SSC) are a subset of the domain concepts that can be joined by some relations or properties. If domain is modeled by ontology then SSC is a subset of the domain ontology concepts. There are several ways to build SSC which can be used separately or together. The user can define SSC directly (manuallyby choosing from the set of ontology concepts) or automaticallyby any mechanism of comparison of ontology with description of user current interests that uses linguistic or statistical properties of this description. SSC can join concepts linked with initial set of concepts by some subset of the ontological relations (directly or through other concepts of the ontology). Each SSC concept has a weight (positive or negative) which determines the degree of semantic similarity of the concept with the initial set of concepts. The work [22], [23] are classified methods of semantic similarity measuring and their software realizations. Methods are grouped by parameters used in estimations and differ within the groups by calculation of these parameters.
For example, ontology is considered as a directed graph where concepts are interconnected by universal and domain-specific relations, mainly taxonomic (is-a). The simplest way to estimate SS between concepts is to calculate the minimum path length that connects the corresponding ontological nodes using "is-a" relation. The longer path between concepts means the major semantic distance between them. If we define a path (2) Despite the simplicity of such estimation, the assumption that different edges of the ontological graph reflect the same semantic distances which do not always correspond to domain causes many problems.
Other estimations are based on the analysis of the path between concepts and their depth in the hierarchy. For example, Wu and Palmer [24], [25] define the SS estimation between the concepts as follows: c respectively to the lowest common generic object c; H is the number of "is a" connections between c and the taxonomy root.
Measures of similarity based on information content [26], [27] determine the similarity of two concepts is defined as the information content of their lowest common generic object: (4) SS estimation parameters from various approaches (for example, from (2) -(4)) can be used for generation of task thesaurus. We can consider such thesaurus as a set of concepts that have semantic distance from some initial set of concepts greater than some constant.
Conclusions. Unstructured and large amounts of information resources, complex hierarchical structure of knowledge of the IS domain cause the need to apply ontological analysis to the processing of Big Data related to information security. Therefore, the application of big data analysis methods to the construction of ontologies in the domain of IS is justified and appropriate. The task thesaurus was proposed as a dynamic element of model that is based on domain ontology that represents more stabile aspects of user interests. Simple structure of task thesaurus provides it's fast and efficient processing, and use of domain ontologies for their generation causes to avoid loss of important information. Semantic similarity estimations provide the theoretical basis for generation of task thesaurus as a set of concepts similar to user current task. The similarity is an important and fundamental concept in many fields.
The prospects of automates generation of ontology-based task thesauri depend on accessibility of pertinent domain ontologies and well-structured, trusted, and actual IRs that characterize user information needs and interests. Therefore, we can find information resources where such parameters are defined explicitly and can be processed without additional pre-processing. Semantic Wiki the where the relationship between concepts and their characteristics are defined through semantic properties correspond with such conditions. АНАТОЛІЙ ГЛАДУН, КАТЕРИНА ХАЛА, ІГОР СУБАЧ