Twitter modeling attempts to demystify self-reporting bias
Accurate analysis of Twitter data is limited by the fact that Twitter is a self-reporting social medium, which introduces a significant self-reporting bias into data, finds new research (.pdf).
The year-long analysis by Emre Kiciman, a researcher in the internet services research center at Microsoft Research, looked at the correlation between the characteristics of real-world weather events and the biases of their representation on Twitter.
For each weather event Kiciman calculated:
- how extreme, or infrequent, it was given the previous 1 month, 3 months, 6 months and 12 months of data,
- how expected it was given the time of year and location, and
- how much it changed compared to 1 day, 3 days and 7 days prior.
The goal was to determine the "extent to which these factors can explain the daily variations in tweet rates about weather events," says the paper.
From National Oceanic and Atmospheric Administration weather data, location data--both user-provided in the Twitter profile location field and explicitly geo-coded coordinates--and weather-related tweets--isolated through analysis of a super set of tweets using weather-related words--Kiciman built global models that show weather information's relationship to variability in tweet rates.
For example, over a 45-day period weather-related tweets peaked in San Diego, Calif. on the hottest day for the period and the first thunderstorm of the season.
From these models Kiciman can predict Twitter rates given the weather. However correlation rates between the model and an individual location's weather tweet rates, vary by region. The same locations have consistently lower or higher correlation scores than average, finds Kiciman.
While the study only presents models and does not investigate the reasoning behind "tweetable" events, Kiciman says his research is a first step.
It's "the beginning of a broader investigation into the properties of real-world events and trends that make them more or less likely to be discussed in social media," writes Kiciman.
Kiciman plans to gather ground-truth data around sports events and concerts which will allow him to "test the influence of additional factors, such as sentiment, as well as investigate to what degree, if any at all, our findings may be applicable across domains."
- download the paper "OMG, I Have to Tweet That! A study of Factors that Influence Tweet Rates" (.pdf)