DOI: https://doi.org/10.20535/2411-1031.2018.6.2.153486

Creation of language networks based on texts with using visibility graphs algorithms

Dmytro Lande, Oleh Dmytrenko

Abstract


A method to constructing language networks is proposed. Key words and concepts from the set of documents which describe some subject domain are retrieved. Numeric values are assigned to each word using a TF-IDF metric, that is intended to reflect how important a word is to a document in a collection or corpus. As the result a time series are constructed. A tool in time series analysis – the visibility graph algorithm is used for constructing the graph of subject domain. In this article two actual subject domains (“Space” and “Computer graphic”) are considered for example. The proposed method is used for the set of documents, which are related with “Space” and “Computer graphic”. A network of connections between terms and concepts, which go into textual documents is builded. Building networks of words, the nodes of which are elements of the text, enables to reveal key components of the text. At the same time, the task of determining the important structural elements of the text which are also informationally important, is actual. As a result of the research, it was found that such words as “uranium”, “nuclear”, “waste”, “Jupiter”, “Mercury”, “Moon”, “Earth”, “comet”, “space” and others are key for the subject area “Space”. This article shows that applying only a TF metric is more expedient compared with the TF-IDF metric in case when the set of documents describe one subject domain. Also the results of applying the visibility graphs algorithm and the compactified horizontal visibility graph algorithm are compared.  It was found that in some case using the compactified horizontal visibility graph algorithm gives a network of words with more quantity of connections between concepts compared with using the visibility graphs algorithm.  An open-source visualization and exploration software for all kinds of graphs and networks Gephi and an original package of specially developed Python modules are used for simulation and visualization as an additional tool. The proposed method can be used for visualization some subject domain, and also for information decision support systems, enabling to reveal key components of a subject domain. Also the results of this article can be used for building UI of information retrieval systems, enabling to make a process of search a relevant information easier.


Keywords


Set of documents; domain; time series; network of words; statistical weight of word; visibility graph; compactified horizontal visibility graph.

References


D. V. Lande, A. A. Snarskii, and I. V. Bezsudnov, Internetika: Navigation in complex networks: models and algorithms. Moscow, Russia: Editorial URSS, 2009.

M. E. J. Newman, “The structure and function of complex networks”, SIAM Review, vol. 45. pp. 167-256, 2003. doi: 10.1137/S003614450342480.

D. V. Lande, Knowledge Search in Internet. Professional work. Moscow, Russia: “Viliams”, 2005.

C. C. Aggarwal, and C. X. Zhai, “Mining text data”, Springer Science & Business Media, pp. 77-128, 2012. doi: 10.1007/978-1-4614-3223-4_1.

G. Miner, J. Elder IV, and T. Hill, Practical text mining and statistical analysis for non-structured text data applications, Waltham, USA: Academic Press, 2012. doi: 10.1016/C2010-0-66188-8

V. Yu. Taranukha, Intelligent processing of texts. Kiev, Ukraine, 2014 [Online]. Available: www.csc.knu.ua/library/books/taranukha-40.pdf.

E. I. Bolshakova, E. S. Klyshinsky, D. V. Lande, A. A. Noskov, O. V. Peskova, and E. V. Yagunova, Automatic processing of texts in a natural language and computational linguistics. Moscow, Russia, 2011 [Online]. Available: http://www.webground.su/data/lit/ bolshakova_klyshinsky_lande_noskov_peskova_yagunova/Avtomaticheskaya_obrabotka_tekstov.pdf/.

L. Lacasa, B. Luque, F. Ballesteros, J. Luque, and J.C. Nuño, “From time series to complex networks: the visibility graph”, Proc. Natl. Acad. Sci. USA 105, pp. 4972-4975, 2008. doi: 10.1073/pnas.0709247105.

A. Nunez, L. Lacasa, J. Gomez, and B. Luque, “Visibility algorithms: A short review, Frontiers in Graph Theory”, InTech, pp. 119-152, 2012. doi: 10.5772/34810.

В. Luque, L. Lacasa, F. Ballesteros, and J. Luque, “Horizontal visibility graphs: Exact results for random time series”, Physical Review E, no. 80(4), pp. 1-11, 2009. doi: 10.1103/PhysRevE.80.046103.

G. Gutin, T. Mansour, and S. Severini, “A characterization of horizontal visibility graphs and combinatoris on words”, Physica A, vol. 390, iss. 12, pp 2421-2428, 2011. doi: 10.1016/j.physa.2011.02.031.

D. V. Lande, and A. A. Snarskii, “Compactified HVG for the Language Network”, in Proc. of the International Conference on Intelligent Information Systems: The Conference is dedicated to the 50th anniversary of the Institute of Mathematics and Computer Science, Chisinau, 2013, pp. 108-113.

D.V. Lande, A.A. Snarskii, E.V. Yagunova, and E. Pronoza, “The Use of Horizontal Visibility Graphs to Identify the Words that Define the Informational Structure of a Text”, In: Proceedings of the 12th Mexican International Conference on Artificial Intelligence, 2013, pp. 209-215. doi: 10.1109/MICAI.2013.33.

D. V. Lande, A. A. Snarskii, and E. V. Yagunova, “Application of the CHVG-algorithm for scientific texts’, in Proc. of the Open Semantic Technologies for Intelligent Systems (OSTIS), Minsk, 2014, pp. 199-204.

D. V. Lande, A. A. Snarskii, and D. Yu. Manko, “The Model of Words Cumulative Influence in a Text”, in Proc. of XVIII International Conference on Data Science and Intelligent Analysis of Information, Cham, 2018, pp. 249-256.

D. Lande, A. Snarskii, E. Yagunova, E. Pronoza, and S. Volskaya, “Hierarchies of Terms on the Euromaidan Events: Networks and Respondents Perception”, in Proc. 12th International Workshop on Natural Language Processing and Cognitive Science NLPCS, pp. 127-139, 2015.

R. Ferrer-i-Cancho, and R. Solé, “The Small World of Human Language”, in Proc. of the Royal Society of London, London, 2001, pp. 2261-2265. doi: 10.1098/rspb.2001.1800.

S. N. Dorogovtsev, and J. F. Mendes, “Language as an Evolving Word Web”, in Proc. of the Royal Society of London, London, 2001, pp. 2603-2606. doi: 10.1098/rspb.2001.1824

S. Caldeira, T. Petit Lobao, R. Andrade, A. Neme, and J. Miranda, “The network of concepts in written texts”, The European Physical Journal B - Condensed Matter and Complex Systems, vol. 49, iss. 4, pp. 523-529, 2006. doi: 10.1140/epjb/e2006-00091-3.

R. Ferrer-i-Cancho, R. Solé, and R. Kohler, “Patterns in syntactic dependency networks”, Physical Review E, vol. 69, iss. 5, pp. 051915, 2004. doi: 10.1103/PhysRevE.69.051915.

R. Ferrer-i-Cancho, “The variation of Zipf’s law in human language”, The European Physical Journal B-Condensed Matter and Complex Systems, vol. 44, iss. 2, pp. 249-257, 2005. doi: 10.1140/epjb/e2005-00121-8.

A. Motter, A. De Moura, Y. Lai, and P. Dasgupta, “Topology of the conceptual network of language”, Physical Review E, vol. 65, iss. 6, pp. 1-4, 2002. doi:10.1103/PhysRevE.65.065102

M. Sigman, and G. Cecchi, “Global Organization of the Wordnet Lexicon”, in Proc. of the National Academy of Sciences, Washington, 2002, pp.1742-1747. doi: 10.1073/pnas.022341799.

J. D. Ullman, “Data Mining, Mining of massive datasets”, Cambridge University Press, pp. 1-17, 2011. doi:10.1017/CBO9781139058452.002.

J. Beel, B. GIPP, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey”, International Journal on Digital Libraries, vol. 17, iss. 4, pp. 305-338, 2016. doi: 10.1007/s00799-015-0156-0.

K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation, vol. 28, iss.11, pp. 11-21, 2004. doi: 10.1108/eb026526.

J. M. Kleinberg, “Authoritative sources in a hyperlink environment”, Journal of the ACM JACM, vol. 46, iss. 5, pp. 604-632, 1999. doi: 10.1145/324133.324140.




ISSN 2411-1031 (Print), ISSN 2518-1033 (Online)