This page lists some useful resources for students and researchers interested in text-as-data
Corpus
Norwegian Colossal Corpus
- Collection of multiple Norwegian corpuses that are suitable for training large language models or conducting independent research. The corpus contains government reports, Stortingsforhandlingene, Evalueringsrapporter, laws and NOUs, online newspapers, Wikipedia, and out-of-copyrights books from the Norwegian National Library.
Norwegian Parliamentary Debates Dataset 1945–2024
with Jon Fiva and Henning Øien, Accepted, Nature Scientific Data (2024)
Data set with all Norwegian Parliamentary speeches in the period 1945 – 2024. We also include speaker and speech meta data (e.g., committee membership, district, minister, elected, deputy…). Can be merged with Fiva and Smith, 2022 for comprehensive background data on national-level politicians.
Norwegian NLP resources
- List of useful Norwegian NLP resources, which covers both data/corpus and methods that supports NLP in Norwegian.
Methods
Intro to Quanteda
- Well-documented and intuitive introduction to quantitative text analysis in R using the Quanteda package
Text Algorithms in Economics (Ash and Hansen, 2023)
- Excellent overview of text algorithms in economics
Text as Data (Gentzkow, Kelly, and Taddy, 2019)
- One of the classic overviews of text-as-data in economic research
Multilanguage Word Embeddings for Social Scientists (Wirshing et al., 2024)
- Multi-language “a la carte” word embeddings (also works in Norwegian)
Other useful resources
Oslo-Bergen Tagger
- Morphological and syntactic tagger for Norwegian. Useful for identifying grammatical morphemes and Part-Of-Speech tagging
Friends Don’t Let Friends Make Bad Graphs
- Some data visualization pitfalls to avoid