Beyond APIs: Collecting Web Data for Research using the National Internet Observatory

Abstract

Widespread Internet use offers unprecedented opportunities to study human behavior at scale, yet researchers face significant ethical and technical barriers when attempting to collect data for academic studies.

In particular, major social media platforms like Facebook and Twitter/X have progressively restricted access to their official Application Programming Interfaces (APIs), which previously served as primary tools for researchers to create customized datasets from specific platforms for studying online content production and engagement behavior.

Content exposure and app usage behavior on mobile phones is an even rarer source of data despite a large chunk of online activity taking place on these devices.

This workshop introduces the National Internet Observatory (NIO), an alternative data collection framework and infrastructure designed to help researchers study online behavior, with a particular focus on content viewing---the predominant form of online activity.

This workshop will present NIO's informed data donation process, participant demographics and behavioral traces across desktop and mobile devices, pathways for data access, and examples of analyses and innovative CSS research enabled by this new source of data.

Interactive activities and hands-on sessions exploring real aggregated datasets will demonstrate NIO's capabilities for enabling novel cross-disciplinary and cross-platform research across web, mobile, and social network environments.

Prerequisites

The workshop is designed for academic researchers from all disciplines interested in online activity data collection and trace and survey data-based research. There are no formal prerequisites for participation, and the contents will be relevant and accessible to all conference participants.

A background in digital behavioral research can help participants better understand the challenges and opportunities that NIO presents, but it is not required. Introductory data science skills in Python/R will help with the hands-on session.

Duration

The workshop is 3 hours long, designed as one continuous half-day session. Break schedules noted below will be adjusted to be in-sync with the schedule for the rest of the conference programming that day.

Schedule

Duration	Activity	Description
30m	NIO Overview and Q&A	Overview of the context of Internet research today, NIO infrastructure, data collection basics, research vision, and the pros and cons of the observatory model of data collection.
30m	Data collection (in depth) & Q&A	Details on all the different types of data that NIO collects: Mobile, Desktop, and Survey Data and details about the the NIO sample/participants. Includes mechanisms of data collection as well as examples of what gets captured.
20m	Parsed Datasets/Data Products + & Q&A	Brief explanation of how the raw, scraped data gets processed and structured to create organized data collections or data products that researchers then interact with to conduct their research. Followed by detailed overviews of certain data products that are available to the research community, along with data analyses examples, and (potentially) an active exploration session of participants interacting with dashboards to visually understand trends and patterns in collected data.
10m	Break
10m	Aggregated Dataset Overview & Q&A	Overview of an aggregated dataset (such as a network of domains visited by NIO participants with edges weighted by similarity in visits or time spent across participants) and some example analyses to motivate further analysis by the workshop participants.
20m	Hands-on Dataset Exploration Session & Q&A	Participants access and explore the above aggregated dataset built using NIO on their own machines.
30m	Research Examples & Q&A	Examples of research conducted using NIO data, to provide concrete ideas on how NIO supports research trying to understand the online information ecosystem.
15m	Onboarding Process and Researcher Experience & Q&A	An overview of what happens after a researcher submits their proposal requesting access to NIO data and the infrastructure for conducting research on social networks and computational social science.
15m	General Q&A + Future datasets feedback	Open-ended Q&A session with all participants, speakers, and organizers, and space for feedback about what participants would like NIO to collect that is not being currently collected.

Learning Outcomes

By the end of the workshop, participants will:

Learn about various methods of online activity data collection along with their pros and cons.
Learn about a new infrastructure and framework for data collection in the post-API age.
Learn about the research and analytical possibilities enabled by NIO and data donations, including the cross-platform potential of working with participants browsing activity, the kinds of research enabled by trace data being linked with survey data, research potential offered by trace data from mobile devices, and the interdisciplinary uses of these alternative data collection methods.
Understand the data collected by NIO and how it could inform their own research.

Stay tuned!

If you are interested in attending this workshop, please sign up here.

Many datasets from the NIO are already open to applications for access from researchers! This includes Google and Bing search data, data on time spent on various webpages by users, and conversations that web users are having with ChatGPT and Gemini. More datasets will continue to be released (such as mobile app usage, social media data, and link transisitions on web browsers): to facilitate exciting computational social science research. Stay tuned and learn about applying for access here.

If you have any questions, feel free to reach out to the primary contact organizer (Pranav Goel) listed below.

Organizers

Dr. Pranav Goel

Primary Contact: p.goel@northeastern.edu | pranav-goel.github.io

Dr. Goel is a Postdoctoral Research Associate at the Network Science Institute, Northeastern University, USA. He obtained his PhD in Computer Science from the University of Maryland, College Park. His research interests broadly span computational social science, particularly using web and text data as a potent digital trace of societal dynamics.

He is currently interested in building a cross-platform understanding of online information consumption, investigating the impact of generative AI on information-seeking behavior, and the sociopolitical phenomenon of framing in news and social media.

His work has been published in major computer science conferences such as NeurIPS, EMNLP, and ICWSM, as well as interdisciplinary journals such as Nature Human Behaviour.

Given his research interests and expertise, he is well-placed to lead discussions on alternative data collection structures and their use for social networks research.

Dr. David Lazer

Contact: d.lazer@northeastern.edu | cssh.northeastern.edu/faculty/david-lazer/ | www.lazerlab.net/people/david-lazer

Dr. Lazer is a University Distinguished Professor of Political Science and Computer Sciences at the Network Science Institute, Northeastern University and an Associate at the Institute for Quantitative Social Science at Harvard University.

In 2019, he was elected a fellow to the National Academy of Public Administration. He has published prominent work on misinformation, democratic deliberation, collective intelligence, computational social science, and algorithmic auditing, across a wide range of prominent journals such as Science, Nature, Proceedings of the National Academy of Science, the American Political Science Review, Organization Science, and the Administrative Science Quarterly.

Dr. Lazer has served in multiple leadership and editorial positions, including as a board member for the International Society for Computational Social Science (ISCSS), the International Network of Social Network Analysts (INSNA), reviewing editor for Science, associate editor of Social Networks and Network Science, numerous other editorial boards and program committees.

His extensive research expertise will guide the context around the Observatory model as a necessary tool to conduct ethical, replicable academic research in networks that does not rely on the generosity of platforms for data access and use.

Dr. Scott Allen Cambo

Contact: s.cambo@northeastern.edu | www.scottallencambo.com

Dr. Cambo is a Senior Data Scientist and the Director of Data Science at the National Internet Observatory (NIO) at Northeastern University, USA. He earned his PhD in Technology and Social Behavior from Northwestern University. His research explores and validates computational methods for analyzing subjective differences between both manual labeling approaches using crowdsourced labor and automated labeling approaches using machine learning.

In his private sector work, Scott has held a variety of critical data science roles at civic tech and responsible AI startups where he developed products for Human-AI Collaboration and designed an algorithm auditing process for NYC LL 144. More recently, he served as General Manager for the AI Incident Database where he worked with both the Center for Advancing Safety in Machine Intelligence (CASMI) and Underwriters Laboratory Research Institute (ULRI) to improve the way we collect, annotate, and share data regarding AI harm.