Beyond APIs: Collecting Web Data for Research using the National Internet Observatory

A Tutorial at WebSci 2025

Location: Academic Building East 2400, New Brunswick, NJ, USA

Tutorial time: Tuesday, May 20, 2025, 9:00 AM - 12:00 PM EDT

Conference Dates: May 20-24, 2025

Abstract

The Internet serves as a vital platform for information access and global connectivity. Widespread online engagement offers unprecedented opportunities to study human behavior at scale, yet researchers face significant ethical and technical barriers when attempting to collect data for academic studies.

In particular, major social media platforms like Facebook and Twitter have progressively restricted access to their official Application Programming Interfaces (APIs), which previously served as primary tools for researchers to create customized datasets from specific platforms for studying online content production and engagement behavior.

This tutorial introduces the National Internet Observatory (NIO), an alternative data collection framework and infrastructure designed to help researchers study online behavior, with a particular focus on content viewing—the predominant form of online activity. The tutorial presents NIO's informed data donation process, participant demographics and behavioral traces, and secure computing infrastructure.

The tutorial incorporates interactive activities and hands-on sessions that demonstrate NIO's capabilities for enabling novel cross-disciplinary and cross-platform research across web and social media environments.

Prerequisites

There are no formal prerequisites for participation. We encourage anyone interested in online activity data collection and research to join. While a background in digital behavioral research helps participants better understand the challenges and opportunities that NIO presents, it is not required to learn about this new data collection infrastructure.

Duration

The tutorial is 3 hours (180 minutes) long, designed as one continuous half-day session.

Schedule

Duration Activity Description
25m NIO Overview Overview presentation on NIO infrastructure, data collection basics, and research vision. Discussion of existing frameworks for online activity data collection across disciplines and their pros/cons.
20m Interactive data analysis: requests gathering Gathering participant requests for data analyses on platform-specific content, trace data vs survey responses, cross-platform behavior, etc.
10m Break
20m Data collection (in depth) Presentation on Mobile, Desktop, and Survey Data. Includes mechanisms of data collection, organized data collections and data products for research, and sample size estimates.
20m Research findings Presentation on some existing analyses and research that uses NIO data. Highlights concrete research possibilities.
30m Hands-on dataset exploration session Participants access and explore an aggregated dataset built using NIO on their own machines.
10m Break
30m Interactive data analysis: results presentation Presenting requested data analyses. Demonstrating analyses requested by participants earlier.
15m Q&A + Future datasets feedback Open-ended Q&A session with all participants, and space for feedback about what participants would like NIO to collect.

Learning Outcomes

By the end of the tutorial, participants will:

  1. Learn about various methods of online activity data collection along with their pros and cons.
  2. Learn about a new infrastructure and framework for data collection in the post-API age.
  3. Learn about the research and analytical possibilities enabled by NIO and data donations, including the cross-platform potential of working with participants browsing activity, the kinds of research enabled by trace data being linked with survey data, and the interdisciplinary uses of these alternative data collection methods.
  4. Understand the data collected by NIO and how it could inform their own ongoing or future research.

Organizers

Dr. Pranav Goel

Dr. Pranav Goel

Primary Contact: p.goel@northeastern.edu | pranav-goel.github.io

Dr. Goel is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Computer Science from the University of Maryland, College Park. His research interests span computational social science & natural language processing, including text-as-data applications in computational political science, analyzing framing in news and social media, and understanding misinformation narratives, and improving topic model evaluation and their ability to assist practitioners.

Currently, he is exploring cross-platform news consumption patterns, assessing the digital local news landscape, and studying the interaction between self-reported survey data and behaviors drawn from observed trace data. His research has been published at top conferences, including winning the 'Outstanding Study Design' award at ICWSM 2023.

Dr. Kai-Cheng Yang

Dr. Kai-Cheng Yang

Contact: yang3kc@gmail.com | www.kaichengyang.me

Dr. Yang is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Informatics from the Luddy School of Informatics, Computing, and Engineering at Indiana University Bloomington. His research aims to create safe, fair, and trustworthy online information platforms by identifying how malicious actors and flawed systems distort information flow and developing effective countermeasures.

His work spans social bots, misinformation, and algorithmic biases. Currently, he is exploring how generative AI is being misused in these contexts and how to harness this technology to protect against these threats.

Dr. Scott Allen Cambo

Dr. Scott Allen Cambo

Contact: s.cambo@northeastern.edu | www.scottallencambo.com

Dr. Cambo is a Senior Data Scientist and Head of Data Science at the National Internet Observatory at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning.

In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed an algorithm auditing process for NYC LL 144. More recently, he served as General Manager for the AI Incident Database where he worked with both the Center for Advancing Safety in Machine Intelligence (CASMI) and Underwriters Laboratory Research Institute (ULRI) to improve the way we collect, annotate, and share data regarding AI harm.