Abstract: Complex networks underlie an enormous variety of social, biological, physical, and virtual systems. A profound complication for the science of complex networks is that in most cases, observing all nodes and all network interactions is impossible. Previous work addressing the impacts of partial network data is surprisingly limited, focuses primarily on missing nodes, and suggests that network statistics derived from subsampled data are not suitable estimators for the same network statistics describing the overall network topology. We generate scaling methods to predict true network statistics, including the degree distribution, from only partial knowledge of nodes, links, or weights. Our methods are transparent and do not assume a known generating process for the network, thus enabling prediction of network statistics for a wide variety of applications. We validate analytical results on four simulated network classes and empirical data sets of various sizes. We perform subsampling experiments by varying proportions of sampled data and demonstrate that our scaling methods can provide very good estimates of true network statistics while acknowledging limits. Lastly, we apply our techniques to a set of rich and evolving large-scale social networks, Twitter reply networks. Based on 100 million tweets, we use our scaling techniques to propose a statistical characterization of the Twitter Interactome from September 2008 to November 2008. Our treatment allows us to find support for Dunbar's hypothesis in detecting an upper threshold for the number of active social contacts that individuals maintain over the course of one week.
Abstract: Many real world, complex phenomena have underlying structures of evolving networks where nodes and links are added and removed over time. A central scientific challenge is the description and explanation of network dynamics, with a key test being the prediction of short and long term changes. For the problem of short-term link prediction, existing methods attempt to determine neighborhood metrics that correlate with the appearance of a link in the next observation period. Recent work has suggested that the incorporation of user-specific metadata and usage patterns can improve link prediction, however methodologies for doing so in a systematic way are largely unexplored in the literature. Here, we provide an approach to predicting future links by applying an evolutionary algorithm to weights which are used in a linear combination of sixteen neighborhood and node similarity indices. We examine Twitter reciprocal reply networks constructed at the time scale of weeks, both as a test of our general method and as a problem of scientific interest in itself. Our evolved predictors exhibit a thousand-fold improvement over random link prediction with high levels of precision for the top twenty predicted links, to our knowledge strongly outperforming all extant methods. Based on our findings, we suggest possible factors which may be driving the evolution of Twitter reciprocal reply networks.
Abstract: The advent of social media has provided an extraordinary, if imperfect, 'big data' window into the form and evolution of social networks. Based on nearly 40 million message pairs posted to Twitter between September 2008 and February 2009, we construct and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of user behavior, we employ our recently developed hedonometric analysis methods to investigate patterns of sentiment expression. We find users' average happiness scores to be positively and significantly correlated with those of users one, two, and three links away. We strengthen our analysis by proposing and using a null model to test the effect of network topology on the assortativity of happiness. We also find evidence that more well connected users write happier status updates, with a transition occurring around Dunbar's number. More generally, our work provides evidence of a social sub-network structure within Twitter and raises several methodological points of interest with regard to social network reconstructions.
Abstract: Within the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased? Previous findings are mixed: suggestive evidence of a positive bias has been found in small samples of English words [1-3], framed as the Pollyanna Hypothesis  and Linguistic Positivity Bias , while the experimental elicitation of emotional words has instead found a strong negative bias . Here, we report that the human-perceived positivity of over 10,000 of the most frequently used English words exhibits a clear positive bias. More deeply, we characterize and quantify distributions of word positivity for four large and distinct corpora, demonstrating that their form is surprisingly invariant with respect to frequency of word use.
Abstract: Individual happiness is a fundamental societal metric. Normally measured through self-report, happiness has often been indirectly characterized and overshadowed by more readily quantifiable economic indicators, such as gross domestic product. Here, we use a real-time, remote-sensing, non-invasive, text-based approach&emdash;a kind of hedonometer&emdash;to uncover collective dynamical patterns of happiness levels expressed by over 50 million users in the online, global social network Twitter. With a data set comprising nearly 2.8 billion expressions involving more than 28 billion words, we explore temporal variations in happiness, as well as information levels, over time scales of hours, days, and months. Among many observations, we find a steady global happiness level, evidence of universal weekly and daily patterns of happiness and information, and that happiness and information levels are generally uncorrelated. We also extract and analyse a collection of happiness and information trends based on keywords, showing them to be both sensible and informative, and in effect generating opinion polls without asking questions. Finally, we develop and employ a graphical method that reveals how individual words contribute to changes in average happiness between any two texts.