A Tutorial at WebSci 2025
Location: Academic Building East 2400, New Brunswick, NJ, USA
Tutorial time: Tuesday, May 20, 2025, 9:00 AM - 12:00 PM EDT
Conference Dates: May 20-24, 2025
The Internet serves as a vital platform for information access and global connectivity. Widespread online engagement offers unprecedented opportunities to study human behavior at scale, yet researchers face significant ethical and technical barriers when attempting to collect data for academic studies.
In particular, major social media platforms like Facebook and Twitter have progressively restricted access to their official Application Programming Interfaces (APIs), which previously served as primary tools for researchers to create customized datasets from specific platforms for studying online content production and engagement behavior.
This tutorial introduces the National Internet Observatory (NIO), an alternative data collection framework and infrastructure designed to help researchers study online behavior, with a particular focus on content viewing—the predominant form of online activity. The tutorial presents NIO's informed data donation process, participant demographics and behavioral traces, and secure computing infrastructure.
The tutorial incorporates interactive activities and hands-on sessions that demonstrate NIO's capabilities for enabling novel cross-disciplinary and cross-platform research across web and social media environments.
There are no formal prerequisites for participation. We encourage anyone interested in online activity data collection and research to join. While a background in digital behavioral research helps participants better understand the challenges and opportunities that NIO presents, it is not required to learn about this new data collection infrastructure.
The tutorial is 3 hours (180 minutes) long, designed as one continuous half-day session.
Duration | Activity | Description |
---|---|---|
25m | NIO Overview | Overview presentation on NIO infrastructure, data collection basics, and research vision. Discussion of existing frameworks for online activity data collection across disciplines and their pros/cons. |
20m | Interactive data analysis: requests gathering | Gathering participant requests for data analyses on platform-specific content, trace data vs survey responses, cross-platform behavior, etc. |
10m | Break | |
20m | Data collection (in depth) | Presentation on Mobile, Desktop, and Survey Data. Includes mechanisms of data collection, organized data collections and data products for research, and sample size estimates. |
20m | Research findings | Presentation on some existing analyses and research that uses NIO data. Highlights concrete research possibilities. |
30m | Hands-on dataset exploration session | Participants access and explore an aggregated dataset built using NIO on their own machines. |
10m | Break | |
30m | Interactive data analysis: results presentation | Presenting requested data analyses. Demonstrating analyses requested by participants earlier. |
15m | Q&A + Future datasets feedback | Open-ended Q&A session with all participants, and space for feedback about what participants would like NIO to collect. |
By the end of the tutorial, participants will:
Primary Contact: p.goel@northeastern.edu | pranav-goel.github.io
Dr. Goel is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Computer Science from the University of Maryland, College Park. His research interests span computational social science & natural language processing, including text-as-data applications in computational political science, analyzing framing in news and social media, and understanding misinformation narratives, and improving topic model evaluation and their ability to assist practitioners.
Currently, he is exploring cross-platform news consumption patterns, assessing the digital local news landscape, and studying the interaction between self-reported survey data and behaviors drawn from observed trace data. His research has been published at top conferences, including winning the 'Outstanding Study Design' award at ICWSM 2023.
Contact: yang3kc@gmail.com | www.kaichengyang.me
Dr. Yang is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Informatics from the Luddy School of Informatics, Computing, and Engineering at Indiana University Bloomington. His research aims to create safe, fair, and trustworthy online information platforms by identifying how malicious actors and flawed systems distort information flow and developing effective countermeasures.
His work spans social bots, misinformation, and algorithmic biases. Currently, he is exploring how generative AI is being misused in these contexts and how to harness this technology to protect against these threats.
Contact: s.cambo@northeastern.edu | www.scottallencambo.com
Dr. Cambo is a Senior Data Scientist and Head of Data Science at the National Internet Observatory at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning.
In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed an algorithm auditing process for NYC LL 144. More recently, he served as General Manager for the AI Incident Database where he worked with both the Center for Advancing Safety in Machine Intelligence (CASMI) and Underwriters Laboratory Research Institute (ULRI) to improve the way we collect, annotate, and share data regarding AI harm.