Wikiversity:Fellow-Programm Freies Wissen/Einreichungen/Systematic modelling and correction of biases in user generated content on social media

Aus Wikiversity
Zur Navigation springen Zur Suche springen

Systematic modelling and correction of biases in user generated content on social media[Bearbeiten]



User generated content (UGC) in social media now plays a paramount role in various studies. Its usefulness is depicted across numerous now-casting and forecasting applications ranging from disruptive event detection [1], understanding the city dynamics [2] to elections result prediction [5]. Furthermore, UGC has streamlined analysis that otherwise would be infeasible or time consuming. However, these analyses have led to wrong conclusions in some cases [3]. Besides the problems in methodology, a major source for these errors are the strong biases [4] present in the UGC. Considerable amount of research have focused on identification of these biases. For example, Mislove et al. [6] calculated the demographics of Twitter users, Hecht and Stephens [7] studied the urban biases, to point out a few. However, limited research has been done on understanding the effect of these biases and furthermore on correcting for these biases. [8] and [9] are two notable studies in this regard.

The primary goal of this project is to bridge this gap and determine large scale computationally accurate model for the biases in social media data. Thus developing more culturally-aware algorithms and systems. In future, modelling of these biases will allow for now-casting and forecasting techniques on wider language and culturally varied data-sets. However, the impact of this work is dependent on the availability of open access data and methods.

Role of open science in my work[Bearbeiten]

In general, reproducibility is a core pillar of science. Unless the research is reproducible, its credibility and importance is significantly diminished. However, reproducibility is also one of the great challenges faced by the science today. The problem of reproducibility varies greatly across the fields, and so does the solutions. But in any case, providing the data and methods as open access is a fundamental step towards credible and reliable science. Since this work can be broadly recognised as a correction step on top of various forecasting and now-casting applications, its usefulness will be strictly judged by the impact it creates on other research. And its success and potential will depend upon availability of open access data and methods.


  1. Alsaedi, N., Burnap, P., & Rana, O. (2017). Can we predict a riot? disruptive event detection using twitter. ACM Transactions on Internet Technology (TOIT), 17(2), 18.
  2. Cranshaw, J., Schwartz, R., Hong, J. I., & Sadeh, N. (2012). The livehoods project: Utilizing social media to understand the dynamics of a city.
  3. Gayo-Avello, D. (2012). No, you cannot predict elections with Twitter. IEEE Internet Computing, 16(6), 91-94.
  4. Ruths, D., & Pfeffer, J. (2014). Social media for large studies of behavior. Science, 346(6213), 1063-1064.
  5. Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. Icwsm, 10(1), 178-185.
  6. Mislove, A., Lehmann, S., Ahn, Y. Y., Onnela, J. P., & Rosenquist, J. N. (2011). Understanding the Demographics of Twitter Users. ICWSM, 11, 5th.
  7. Hecht, B. J., & Stephens, M. (2014). A Tale of Cities: Urban Biases in Volunteered Geographic Information. ICWSM, 14, 197-205.
  8. Johnson, I., McMahon, C., Schöning, J., & Hecht, B. (2017, May). The Effect of Population and. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (pp. 1167-1178). ACM.
  9. Johnson, I. L., Sengupta, S., Schöning, J., & Hecht, B. (2016, May). The geography and importance of localness in geotagged social media. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 515-526). ACM.