The Language Networks
The language networks book provides insights into the principles of modeling and analyzing structural properties of language – manly in its written form, hence text. Book guidelines the basic principles of text preprocessing, covering the very initial steps needed for any natural language processing task. Further, the book examines the possibilities of representing text in a complex networks framework. The second part overviews the application of language networks as one of a data science disciplines. It covers important data science topics for the processing of the big textual data from extracting the most salient structural parts of documents, across differentiation between text genres to predicting the speeding of the information through social media. Finally, the last part of the book is tasked with formal modeling of the linguistic subsystems in a multilayer complex networks formalism, which allows systematic study of language across all of its subsystems.
The first part of the book studies the general principles of language networks construction and analysis. It covers language network construction types. Specifically, it analyzes the effects of constructing directed vs. undirected, weighted vs. unweighted network from lemmatized (stemmed) or non-lemmatized texts with stopwords included or excluded. The effects of text randomization are studied enabling better insights into characteristics of language networks compared to their shuffled counterparts. Some preliminary experiments reveal the possibilities of the differentiation of the structural properties of networks constructed from different text types and in different languages like Croatian, English, and Italian. Next, some initial insights into the characterization of syllabic networks are presented. The analysis of motifs of the linguistic networks reveals the typical building blocks of the structure of networks of the literature in the Croatian language. Finally, the first part of the book concludes with the LaNCoA a Python Toolkit for the construction and analysis of language networks implementing the majority of the findings presented in this part of the book. The second part of the book is dedicated to the applications of language networks. The language networks enable the extraction of the most salient words in texts – keywords and extraction of the domain knowledge-context studied on the content of Wikipedia entries. The applicative part of language networks includes the differentiation between different text types and polarization of tweets, as well. Finally, the possibilities of predicting the future content of tweets solely from the structural properties of the complex language networks are presented. The third part of the book presents the formal model of language networks. Multilayered language network represents a comprehensive framework based on the multilayered graphs that can model various aspects of language like subsystems at the different level in the hierarchy, the construction principles, the language types and others. Multilayer language model serves as a unified formal model for the representation of language within the complex networks theory.
The LangNet project has been going from strength to strength since its inception, and today it is a comprehensive source of new knowledge concerning language networks. The LangNet book by Sanda Martinčić-Ipšić and Ana Meštrović fully delivers on the promise of the project to become a benchmark reference for language networks to be used for the quantitative study of structure of a language at its various sublevels, ranging from phonological over morphological and syntactic to semantic. Although language networks are easily defined as vertices and links where the former are the words and the latter are linguistic interactions between them, this book demonstrates convincingly that this powerful formalism enables us to study and understand the structural complexity of a language, its evolution, the acquisition of a language, as well as modeling of mental lexicons, assessing the quality of written texts, and authorship attribution, to name only the most prominent examples. Language networks thus emerge as bridges between complexity science and linguistics, and as a reader goes through the book, these bridges, one by one, are revealed in clarity and detail that incites never to let the book rest. On the applied level, the book is a treasure of data and research on the Croatian language, which thus acquired a monument of its own at the forefront of quantitative linguistics that only few other world languages can call their own. I warmly recommend this book to anyone interested in natural language processing, complex networks, data analysis, corpora development, or data mining, and I compliment the authors on an excellent work.
The Language Networks book by Sanda Martinčić-Ipšić and Ana Meštrović delivers an ambitious study of language from the perspective of complex networks covering three broad domains; language networks construction, language networks applications and multilayer modelling approach. Sanda Martinčić-Ipšić started LangNet project four years ago with the goal to investigate Croatian language properties by means of complex networks. Moreover, besides the Croatian language, the primary objective of the project was to discover language properties in general. She and her team enthusiastically work in that challenging field that is an intersection of computer science, complex systems and linguistic. The results of the five years of intensive effort in research is summarized in this scientific monograph, capturing three critical aspects of language networks: (1) language networks construction, structural properties of language and the emergence of universal characteristics of languages, (2) possible applications of the network-based representations of language in order to develop practical systems such as keyword extraction, link prediction, extraction of domain knowledge and (3) definition of multilayer network-based formal system for language representation. The findings reported in The language networks book are in line with the results of other similar studies, mostly reported for well-studied languages like English. However, this study represents a significant contribution in the field of language networks in three aspects. Firstly, it examines various construction principles and how this reflects on language network properties. Secondly, it proposes applications of language networks in the NLP domain, more specifically, for keyword identification and domain knowledge extraction tasks and for inferring the new content of microblogs. Thirdly, it offers a novel formal model for an integral representation of language based on the multilayer networks model. Presented work addresses fundamental questions from the perspective of complex networks, such as: what is the general structure of language? Is it possible to make a distinction between languages by using language networks properties? What are potential applications of language networks? And this scientific monograph offers well-elaborated answers. The language networks, arising from an overlap between fields of complex systems and linguistics, are a particular, narrow scientific domain tackled by a limited amount of researchers. Thus, there is a lack of research and scientific publications in that field. This monograph is one of few existing, comprehensive studies that successfully fills this gap. I enjoyed reading this book. Presented research provides the intersected point of view from that of linguistics and complex systems, and it is a valuable study on language networks, pertinent for both, the linguistic community and the complex network's community. I warmly recommend this book to both research realms.
Complex networks pervade modern science, as increasingly many natural and social phenomena are elegantly modeled through this powerful framework. Linguistics is no exception. Over the past decade, networks have indeed showed potential as a new computational framework for analyzing languages and a whole range of linguistic phenomena. In their new book "The language networks", Sanda Martinčić-Ipšić and Ana Meštrović offer to the readers a state-of-the-art platform for both ongoing research and the new cutting-edge frontiers in the field. Relying on Croatian language, authors provide an overview of various ways to construct language networks, including latest software tool kits. They continue with comprehensive list of applications of language networks, including keyword extraction and link prediction on Twitter, which indeed shows the full potential of this recent approach. Finally, they conclude the book by presenting multi-layered language model as the most recent development in this vibrant field. This work is theoretically sound as it clearly identifies the niche in the existing literature. Added scientific value consists of both deep theoretical understanding and systematic methodological toolbox, hence opening new questions for future work. I find this work to be of big relevance and adequate timeliness, as it summarizes the ongoing work as well as future directions. In particular, I noticed that authors found a way to integrate and present both their own results and the recent results of the respective community.