Abstract: Many research fields codify their findings in standard formats, often by reporting correlations between quantities of interest. But the space of all testable correlates is far larger than scientific resources can currently address, so the ability to accurately predict correlations would be useful to plan research and allocate resources. Using a dataset of approximately 170,000 correlational findings extracted from leading social science journals, we show that a trained neural network can accurately predict the reported correlations using only the text descriptions of the correlates. Accurate predictive models such as these can guide scientists towards promising untested correlates, better quantify the information gained from new findings, and has implications for moving artificial intelligence systems from predicting structures to predicting relationships in the real world.
Abstract: We used the metaBUS data release v2.08 for our corpus of correlate pairs. These data are available for
download at http://www.frankbosco.com/data/CorrelationalEffectSizeBenchmarks.html. More up-to-date
data are searchable using the metaBUS web interface: http://metabus.org. We also used the 300-dimension
(English) word vector representations released by the ConceptNet project called ConceptNet Numberbatch,
specifically version 17.06: https://github.com/commonsense/conceptnet-numberbatch.
The correlate texts have already been curated by the metaBUS team but we performed some further
processing to connect the individual words (tokens) to terms in Numberbatch. Nonalphanumeric characters
were removed, casing was removed, and text was tokenized on whitespace. Tokens were then mapped to corresponding word vector indices in Numberbatch. A vector “index” of 0 was reserved for tokens in
metaBUS not present in Numberbatch. The neural network is able to handle tokens outside the vocabulary
of Numberbatch, although predictive performance is likely wose when many tokens are missing than when
few or no tokens are missing.