Beyond APIs: Collecting Web Data for Research using the National Internet Observatory

Abstract

The Internet serves as a vital platform for information access and global connectivity. Widespread online engagement offers unprecedented opportunities to study human behavior at scale, yet researchers face significant ethical and technical barriers when attempting to collect data for academic studies.

In particular, major social media platforms like Facebook and Twitter have progressively restricted access to their official Application Programming Interfaces (APIs), which previously served as primary tools for researchers to create customized datasets from specific platforms for studying online content production and engagement behavior.

This tutorial introduces the National Internet Observatory (NIO), an alternative data collection framework and infrastructure designed to help researchers study online behavior, with a particular focus on content viewing—the predominant form of online activity. The tutorial presents NIO's informed data donation process, participant demographics and behavioral traces, and secure computing infrastructure.

The tutorial incorporates interactive activities and hands-on sessions that demonstrate NIO's capabilities for enabling novel cross-disciplinary and cross-platform research across web and social media environments.

Prerequisites

There are no formal prerequisites for participation. We encourage anyone interested in online activity data collection and research to join. While a background in digital behavioral research helps participants better understand the challenges and opportunities that NIO presents, it is not required to learn about this new data collection infrastructure.

Duration

The tutorial is 3 hours (180 minutes) long, designed as one continuous half-day session.

Schedule

Duration	Activity	Description
9:00 am - 9:30 am	NIO Overview and Q&A	Overview presentation on NIO infrastructure, data collection basics, and research vision. Discussion of existing frameworks for online activity data collection across disciplines and their pros/cons.
9:30 am - 10:00 am	Data collection (in depth) & Q&A	Presentation on Mobile, Desktop, and Survey Data. Includes mechanisms of data collection, organized data collections and data products for research, and sample size estimates.
10:00 am - 10:30 am	Real dataset exploration session	Demonstration and exploration of an aggregated dataset built using NIO.
10:30 am - 11:00 am	Coffee Break and Chat
11:00 am - 11:40 am	Examples of research and data analyses using NIO data	Presentation on some existing analyses and research that uses NIO data. Highlights concrete research possibilities.
11:40 am - 12:00 pm	Q&A + Future datasets feedback	Open-ended Q&A session with all participants, and space for feedback about what participants would like NIO to add and share publicly in the future.

Learning Outcomes

By the end of the tutorial, participants will:

Learn about various methods of online activity data collection along with their pros and cons.
Learn about a new infrastructure and framework for data collection in the post-API age.
Learn about the research and analytical possibilities enabled by NIO and data donations, including the cross-platform potential of working with participants browsing activity, the kinds of research enabled by trace data being linked with survey data, and the interdisciplinary uses of these alternative data collection methods.
Understand the data collected by NIO and how it could inform their own ongoing or future research.

Organizers

Dr. Pranav Goel

Primary Contact: p.goel@northeastern.edu | pranav-goel.github.io

Dr. Goel is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Computer Science from the University of Maryland, College Park. His research interests span computational social science & natural language processing, including text-as-data applications in computational political science, analyzing framing in news and social media, and understanding misinformation narratives, and improving topic model evaluation and their ability to assist practitioners.

Currently, he is exploring cross-platform news consumption patterns, assessing the digital local news landscape, and studying the interaction between self-reported survey data and behaviors drawn from observed trace data. His research has been published at top conferences, including winning the 'Outstanding Study Design' award at ICWSM 2023.

Dr. Kai-Cheng Yang

Contact: yang3kc@gmail.com | www.kaichengyang.me

Dr. Yang is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Informatics from the Luddy School of Informatics, Computing, and Engineering at Indiana University Bloomington. His research aims to create safe, fair, and trustworthy online information platforms by identifying how malicious actors and flawed systems distort information flow and developing effective countermeasures.

His work spans social bots, misinformation, and algorithmic biases. Currently, he is exploring how generative AI is being misused in these contexts and how to harness this technology to protect against these threats.

Dr. Scott Allen Cambo

Contact: s.cambo@northeastern.edu | www.scottallencambo.com

Dr. Cambo is a Senior Data Scientist and Head of Data Science at the National Internet Observatory at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning.

In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed an algorithm auditing process for NYC LL 144. More recently, he served as General Manager for the AI Incident Database where he worked with both the Center for Advancing Safety in Machine Intelligence (CASMI) and Underwriters Laboratory Research Institute (ULRI) to improve the way we collect, annotate, and share data regarding AI harm.