RESOURCES
SYLLABIFICATION FOR CROATIAN BASED ON MAXIMAL ONSET PRINCIPLE
Croatian lemmatized dictionary syllabified on maximal onset principle (62 387 lexemes).
Croatian dictionary with all flective forms syllabified on maximal onset principle (377 143 lexemes).
The list of Croatian words with "JAT" (adjusted for syllabification).
Algorithm source code in Python.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
A. Meštrović, S. Martinčić-Ipšić, M. Matešić.
"Syllabification based on maximal onset principle for Croatian / Postupak automatskoga slogovanja temeljem načela najvećega pristupa i statistika slogova za hrvatski jezik". Govor/Speech, vol. 32, No. 1, pp. 3-35, 2015.
TWITTER EMO-NET DATASET
Twitter emo-net dataset contains four sets of tweets in English language collected according to the following search criteria: a) tweets associated with immigrant and war related events
(e.g. terrorist, terrorism, ISIS, etc.); b) tweets containing negatively polarized words (e.g. anger, fear, hate, etc.); c) tweets associated with pets (e.g. puppy, kitty, etc.)
and d) tweets containing positively polarized words (e.g. joy, happiness, happy, etc.). Please consult readme file for details.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
S. Martinčić-Ipšić, E. Močibob, M. Perc. Link Prediction on Twitter,
PLOS ONE, 12(7): e0181079. 2017. https://doi.org/10.1371/journal.pone.0181079
KEYWORD EXTRACTION DATASET
Bilingual-Serbian-English-KE-Dataset
The bilingual Serbian-English dataset for Keyword Extraction task consists 50 parallel Serbian-English abstracts from the scientific journal "Underground Mining Engineering", from the domain of geology and mining, published by the University of Belgrade, Faculty of Mining and Geology. All the documents are supplied with metadata and keywords, annotated by human experts – the authors of the articles.
This is first bilingual Serbian-English dataset suitable for Keyword Extraction and similar NLP tasks. Furthermore, benchmark results for bilingual keyword extraction with the unsupervised graph-based method, called Selectivity-Based Keyword Extraction (SBKE), on the Serbian-English parallel text are tested and presented. For details, please consult readme file and paper.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
S. Beliga, Kitanović, O., Stanković, R., S. Martinčić-Ipšić. "Keyword Extraction from Parallel Abstracts of Scientific Publications". Semantic Keyword-Based Search on Structured Data Sources LNCS 10546, Szymański, J., Velegrakis, Y. (ed.). Cham : Springer International Publishing, COST Action IC1302 Third International KEYSTONE Conference, IKC 2017, Gdańsk, Poland, pp. 44-45, 2018.
SEAGOING SHIP SENSOR DATASET
Seagoing ship SENSOR DATASET contains four merged data sources from a liquefied petroleum carrier similar to the recent series of the South Korean shipbuilder with a capacity of 54,340 DWT, length 225 m, and width 37 m were used for training and testing: 1) measurement data from the ship automation system as the primary source, (2) data taken from the electronic chart display and information system (ECDIS), (3) data from noon reports, and (4) available meteorological and oceanographic data in accordance with the geographical position and sailing time.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
A. Vorkapić, R. Radonja, S. Martinčić-Ipšić. Predicting Seagoing Ship Energy Efficiency from the Operational Data, Sensors, Vol 21, 2832 2021. https://doi.org/10.3390/s21082832