RESOURCES
BERTmosphere
A collection of 4 pretrained language models for the climate change research domain: CliReBERT, CliSciBERT, SciClimateBERT, CliReDistilRoBERTa
CliReBERT (Climate Research BERT) is a domain-specific BERT model pretrained from scratch on a curated corpus of peer-reviewed climate change research papers. It is built to support natural language processing tasks in climate science and environmental studies.
CliSciBERT is a domain-adapted version of SciBERT, further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.
SciClimateBERT is a domain-adapted version of ClimateBERT, further pretrained on peer-reviewed scientific papers focused on climate change. While ClimateBERT is tuned for general climate-related text, SciClimateBERT narrows the focus to high-quality academic content, improving performance in scientific NLP applications.
CliReDistilRoBERTa is a climate change domain DistilRoBERTa, trained from scratch.
Model are avilable at Hugging Face.
Source code for training is at GitHub.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
Poleksić, A., Martinčić-Ipšić, S. Pretraining and evaluation of BERT models for climate research. Discover Applied Sciences, 7, 1278, 2025. https://doi.org/10.1007/s42452-025-07740-5.
SYLLABIFICATION FOR CROATIAN BASED ON MAXIMAL ONSET PRINCIPLE
Croatian lemmatized dictionary syllabified on maximal onset principle (62 387 lexemes).
Croatian dictionary with all flective forms syllabified on maximal onset principle (377 143 lexemes).
The list of Croatian words with "JAT" (adjusted for syllabification).
Algorithm source code in Python.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
A. Meštrović, S. Martinčić-Ipšić, M. Matešić.
"Syllabification based on maximal onset principle for Croatian / Postupak automatskoga slogovanja temeljem načela najvećega pristupa i statistika slogova za hrvatski jezik". Govor/Speech, vol. 32, No. 1, pp. 3-35, 2015.
TWITTER EMO-NET DATASET
Twitter emo-net dataset contains four sets of tweets in English language collected according to the following search criteria: a) tweets associated with immigrant and war related events
(e.g. terrorist, terrorism, ISIS, etc.); b) tweets containing negatively polarized words (e.g. anger, fear, hate, etc.); c) tweets associated with pets (e.g. puppy, kitty, etc.)
and d) tweets containing positively polarized words (e.g. joy, happiness, happy, etc.). Please consult readme file for details.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
S. Martinčić-Ipšić, E. Močibob, M. Perc. Link Prediction on Twitter,
PLOS ONE, 12(7): e0181079. 2017. https://doi.org/10.1371/journal.pone.0181079
KEYWORD EXTRACTION DATASET
Bilingual-Serbian-English-KE-Dataset
The bilingual Serbian-English dataset for Keyword Extraction task consists 50 parallel Serbian-English abstracts from the scientific journal "Underground Mining Engineering", from the domain of geology and mining, published by the University of Belgrade, Faculty of Mining and Geology. All the documents are supplied with metadata and keywords, annotated by human experts – the authors of the articles.
This is first bilingual Serbian-English dataset suitable for Keyword Extraction and similar NLP tasks. Furthermore, benchmark results for bilingual keyword extraction with the unsupervised graph-based method, called Selectivity-Based Keyword Extraction (SBKE), on the Serbian-English parallel text are tested and presented. For details, please consult readme file and paper.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
S. Beliga, Kitanović, O., Stanković, R., S. Martinčić-Ipšić. "Keyword Extraction from Parallel Abstracts of Scientific Publications". Semantic Keyword-Based Search on Structured Data Sources LNCS 10546, Szymański, J., Velegrakis, Y. (ed.). Cham : Springer International Publishing, COST Action IC1302 Third International KEYSTONE Conference, IKC 2017, Gdańsk, Poland, pp. 44-45, 2018.
SEAGOING SHIP SENSOR DATASET
Seagoing ship SENSOR DATASET contains four merged data sources from a liquefied petroleum carrier similar to the recent series of the South Korean shipbuilder with a capacity of 54,340 DWT, length 225 m, and width 37 m were used for training and testing: 1) measurement data from the ship automation system as the primary source, (2) data taken from the electronic chart display and information system (ECDIS), (3) data from noon reports, and (4) available meteorological and oceanographic data in accordance with the geographical position and sailing time.
The resources are published under the CC BY-NC-SA 4.0 license.
Please cite the following paper:
A. Vorkapić, R. Radonja, S. Martinčić-Ipšić. Predicting Seagoing Ship Energy Efficiency from the Operational Data, Sensors, Vol 21, 2832 2021. https://doi.org/10.3390/s21082832