langnet

RESOURCES

SYLLABIFICATION FOR CROATIAN BASED ON MAXIMAL ONSET PRINCIPLE


txt Croatian lemmatized dictionary syllabified on maximal onset principle (62 387 lexemes).

txt Croatian dictionary with all flective forms syllabified on maximal onset principle (377 143 lexemes).

txt The list of Croatian words with "JAT" (adjusted for syllabification).

sw Algorithm source code in Python.


The resources are published under the CC BY-NC-SA 4.0 license. Creative Commons licenca


Please cite the following paper:

pdf A. Meštrović, S. Martinčić-Ipšić, M. Matešić. "Syllabification based on maximal onset principle for Croatian / Postupak automatskoga slogovanja temeljem načela najvećega pristupa i statistika slogova za hrvatski jezik". Govor/Speech, vol. 32, No. 1, pp. 3-35, 2015.


TWITTER EMO-NET DATASET


txt Twitter emo-net dataset contains four sets of tweets in English language collected according to the following search criteria: a) tweets associated with immigrant and war related events (e.g. terrorist, terrorism, ISIS, etc.); b) tweets containing negatively polarized words (e.g. anger, fear, hate, etc.); c) tweets associated with pets (e.g. puppy, kitty, etc.) and d) tweets containing positively polarized words (e.g. joy, happiness, happy, etc.). Please consult readme file for details.


Please cite the following paper:

pdf S. Martinčić-Ipšić, E. Močibob, M. Perc. Link Prediction on Twitter, PLOS ONE, 12(7): e0181079. 2017. https://doi.org/10.1371/journal.pone.0181079


KEYWORD EXTRACTION DATASET


txtBilingual-Serbian-English-KE-Dataset

The bilingual Serbian-English dataset for Keyword Extraction task consists 50 parallel Serbian-English abstracts from the scientific journal "Underground Mining Engineering", from the domain of geology and mining, published by the University of Belgrade, Faculty of Mining and Geology. All the documents are supplied with metadata and keywords, annotated by human experts – the authors of the articles.

This is first bilingual Serbian-English dataset suitable for Keyword Extraction and similar NLP tasks. Furthermore, benchmark results for bilingual keyword extraction with the unsupervised graph-based method, called Selectivity-Based Keyword Extraction (SBKE), on the Serbian-English parallel text are tested and presented. For details, please consult readme file and paper.


Please cite the following paper:

pdf Beliga, S., Kitanović, O., Stanković, R., & Martinčić-Ipšić, S. (in press). Keyword Extraction from Parallel Abstracts of Scientific Publications. IKC 2017.