Federated Web Search

We have released several datasets to support research on federated web search. The datasets contain samples from real search engines.

Movember motivations

Motivations provided in Movember profiles annotated according to the Social Identity Model of Col- lective Action (van Zomeren et al., 2008).
Download: [zip file]

D. Nguyen, T. van den Broek, C. Hauff, D. Hiemstra and M. Ehrenhard: #SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns at EMNLP 2015. [pdf]

NL-TR word level language identification

Posts from a Turkish/Dutch online forum with word level language annotations
[Download it here]

D. Nguyen, A.S. Doğruöz : Word Level Language Identification in Online Multilingual Communication at EMNLP 2013. [pdf]

Age prediction

Annotations used in the ICWSM 2013 paper on age prediction. Download: [zip file]

D. Nguyen, R. Gravel, D. Trieschnigg and T. Meder: "How Old Do You Think I Am?": A Study of Language and Age in Twitter at ICWSM 2013. [pdf]

Kernel independence testing

Synthetic datasets: [zip file]
Code: [Github]

D. Nguyen and J. Eisenstein. A Kernel Independence Test for Geographical Language Variation. Arxiv [pdf] Computational Linguistics, Volume 43, Issue 3. 2017.

Evaluating local explanations for text classification

Data: [zip file] (75.1 MB)
Code: [Github]

D. Nguyen. Comparing automatic and human evaluation of local explanations for text classification. NAACL 2018.

Urban Dictionary

Annotations: [Github]

D. Nguyen, B. McGillivray, T. Yasseri. Emo, love, and god: Making sense of Urban Dictionary, a crowd-sourced online dictionary. To appear in Royal Society Open Science. [Arxiv preprint]