Beyond APIs: Collecting Online Activity Data for Research using the National Internet Observatory

A Satellite at NetSci 2026

Location: Remington room, Hyatt Regency Boston / Cambridge (575 Memorial Dr., Cambridge, MA), USA

Satellite time: Monday, June 1, 2026, 9:00 AM - 12:30 PM EDT

Conference Dates: June 1-5, 2026

Abstract

The Internet serves as a vital platform for information access and global connectivity. Widespread online engagement offers unprecedented opportunities to study human behavior at scale, yet researchers face significant ethical and technical barriers when attempting to collect data for academic studies.

In particular, major social media platforms like Facebook and Twitter have progressively restricted access to their official Application Programming Interfaces (APIs), which previously served as primary tools for researchers to create customized datasets from specific platforms for studying online content production and engagement behavior. Content exposure and app usage behavior on mobile phones is an even rarer source of data despite a large chunk of online activity taking place on these devices.

This satellite introduces the National Internet Observatory (NIO), an alternative data collection framework and infrastructure designed to help researchers study online behavior, with a particular focus on content viewing---the predominant form of online activity.

The satellite will present NIO's informed data donation process, participant demographics and behavioral traces, secure computing infrastructure and pathways for data access, and examples of analyses and innovative research with this new source of data for the network science community.

The satellite will incorporate interactive activities and hands-on sessions with real network datasets that demonstrate NIO's capabilities for enabling novel cross-disciplinary and cross-platform research across web and social network environments.

Prerequisites

The satellite is designed for academic researchers from all disciplines represented at the NetSci conference. There are no formal prerequisites for participation. We encourage anyone interested in online activity data collection and research to join.

While a background in digital behavioral research helps participants better understand the challenges and opportunities that NIO presents, it is not required to learn about this new data collection infrastructure. Introductory data science skills in Python/R will help with the hands-on session where participants explore an aggregated network dataset on their own devices.

Duration

The satellite is 3.5 hours long, designed as one continuous half-day session. There will be a formal 30-minute coffee break at 10:30 AM, designed to be in-sync with the schedule for the rest of the satellites and conference programming that day.

Schedule

Duration Activity Description
9:00 am - 9:30 am NIO Overview and Q&A Overview presentation on the context of Internet research today, NIO infrastructure, data collection basics, research vision, and the pros of the observatory model of data collection.
9:30 am - 10:00 am Data collection (in depth) & Q&A Details on all the different types of data that NIO collects: Mobile, Desktop, and Survey Data and details about the the NIO sample/participants. Includes mechanisms of data collection as well as examples of what gets captured.
10:00 am - 10:30 am Parsed Datasets/Data Products & Q&A Brief explanation of how the raw, scraped data gets processed and structured to create organized data collections or data products that researchers then interact with to conduct their research. Followed by detailed overviews of certain data products that are available to the research community, along with data analyses examples.
10:30 am - 11:00 am Coffee Break and Chat
11:00 am - 11:10 am Aggregated Network Dataset Overview & Q&A Overview of an aggregated network dataset (such as a network of domains visited by NIO participants with edges weighted by similarity in visits or time spent across participants), and some example analyses to motivate further analysis by the satellite participants.
11:10 am - 11:30 am Hands-on Dataset Exploration Session & Q&A Participants access and explore the above aggregated dataset built using NIO on their own machines.
11:30 am - 12:00 pm Research Examples & Q&A Examples of research conducted using NIO data, to provide concrete ideas on how NIO supports research trying to understand the online information ecosystem.
12:00 pm - 12:15 pm Onboarding Process and Researcher Experience & Q&A An overview of what happens after a researcher submits their proposal requesting access to NIO data and what the infrastructure for conducting research looks like with this sensitive data source.
12:15 pm - 12:30 pm General Q&A + Future datasets feedback Open-ended Q\&A session with all participants, speakers, and organizers, and space for feedback about what participants would like NIO to collect that is not being currently collected.

Learning Outcomes

By the end of the satellite, participants will:

  1. Learn about various methods of online activity data collection along with their pros and cons.
  2. Learn about a new infrastructure and framework for data collection in the post-API age.
  3. Learn about the research and analytical possibilities enabled by NIO and data donations, including the cross-platform potential of working with participants browsing activity, the kinds of research enabled by trace data being linked with survey data, research potential offered by trace data from mobile devices, and the interdisciplinary uses of these alternative data collection methods.
  4. Understand the data collected by NIO and how it could inform their own ongoing or future network science research.

Participant Capacity

Due to space constraints, up to 30 participants can be accommodated for this satellite, on a first-come, first-serve basis! Signing up to express interest in attending this satellite and learning about NIO can also help us assess capacity constraints and adjust.

Stay tuned!

If you are interested in attending this satellite, please sign up here.

Many datasets from the NIO are already open to applications for access from researchers! This includes Google and Bing search data, data on time spent on various webpages by users, and conversations that web users are having with ChatGPT and Gemini. More datasets will continue to be released (such as mobile app usage, social media data, and link transisitions on web browsers): to facilitate exciting network science and computational social science research. Stay tuned and learn about applying for access here.

If you have any questions, feel free to reach out to the primary contact organizer (Pranav Goel) listed below.

Organizers and Confirmed Speakers

Dr. Pranav Goel

Dr. Pranav Goel

Primary Contact: p.goel@northeastern.edu | pranav-goel.github.io

Dr. Goel is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Computer Science from the University of Maryland, College Park. His research interests broadly span computational social science, particularly using web and text data as a potent digital trace of societal dynamics.

He is currently interested in building a cross-platform understanding of online information consumption, investigating the impact of generative AI on information-seeking behavior, and the sociopolitical phenomenon of framing in news and social media.

His work has been published in major computer science conferences such as NeurIPS, EMNLP, and ICWSM, as well as interdisciplinary journals such as Nature Human Behaviour.

Dr. David Lazer

Dr. David Lazer

Contact: d.lazer@northeastern.edu | cssh.northeastern.edu/faculty/david-lazer/ | www.lazerlab.net/people/david-lazer

Dr. Lazer is a University Distinguished Professor of Political Science and Computer Sciences at the Network Science Institute, Northeastern University and an Associate at the Institute for Quantitative Social Science at Harvard University.

In 2019, he was elected a fellow to the National Academy of Public Administration. He has published prominent work on misinformation, democratic deliberation, collective intelligence, computational social science, and algorithmic auditing, across a wide range of prominent journals such as Science, Nature, Proceedings of the National Academy of Science, the American Political Science Review, Organization Science, and the Administrative Science Quarterly.

Dr. Lazer has served in multiple leadership and editorial positions, including as a board member for the International Society for Computational Social Science (ISCSS), the International Network of Social Network Analysts (INSNA), reviewing editor for Science, associate editor of Social Networks and Network Science, numerous other editorial boards and program committees.

His extensive research expertise will guide the context around the Observatory model as a necessary tool to conduct ethical, replicable academic research in network science that does not rely on the generosity of platforms for data access and use.

Dr. Scott Allen Cambo

Dr. Scott Allen Cambo

Contact: s.cambo@northeastern.edu | www.scottallencambo.com

Dr. Cambo is a Senior Data Scientist and the Director of Data Science at the National Internet Observatory (NIO) at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning.

In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed an algorithm auditing process for NYC LL 144. More recently, he served as General Manager for the AI Incident Database where he worked with both the Center for Advancing Safety in Machine Intelligence (CASMI) and Underwriters Laboratory Research Institute (ULRI) to improve the way we collect, annotate, and share data regarding AI harm.