SABINE readme

SABINE (SociAl Business INtelligence bEnchmark) is a multi-purpose dataset for Social Business Intelligence (SBI) in the domain of European politics. This page provides documentation and usage instructions for SABINE's packages. For details on the SABINE project, please refer to purl.org/sabine.

Overview

SABINE is designed and properly packaged for modular download, to enable the evaluation of a wide variety of social business intelligence research tasks, either separately or in combination, ranging from those more focused on content analysis, to those related to semantic analysis up to more comprehensive social business analytics.

Clips package

Format: JSON
Download (6.06 GB; unzipped 36.5 GB): link

The Clips package contains the user-generated content (UGC) collected by the crawling service. Clips can either be messages posted on social media or articles
taken from on-line newspapers. Clips are also available in sub-packages

Schema

Each clip contains the following data:

  • id: a numeric identificator;
  • language: a two-letters string: either "en" or "it";
  • title: a string containing either the title of the UGC (if present) or the first excerpt of the content;
  • content: a string with the content of the UGC;
  • clippingQuality (optional): human-based score (either "OK" or "KO") denoting the amount of non-relevant text present in the clip due to an inadequate template used by the crawler when clipping;
  • textComplexity (optional): human-based score (either "STANDARD" or "HARD") denoting the effort of a human expert in assigning the sentiment (e.g., due to irony, incorrect syntax, abbreviations);
  • expertSentiment (optional): human-based score (either -1, 0 or 1) denoting the sentiment (respectively negative, neutral or positive) labeled by domain experts.
  • occurrences: an array of objects representing the entities detected by the text analysis module of SyN; each entity contains the following information:
    • entity.name: the name of the entity
    • entity.pos: the part-of-speech (i.e., "N" for nouns, "R" for proper nouns, "G" for adjectives, "V" for verbs, "A" for adverbs, "U" for not recognized words)
    • count: the number of times that the entity appears in the clip
  • semanticOccurrences: an array of objects representing the relationship between two entities by means of either a functional relation (e.g.,
    agent or qualifier) or a predicate corresponding to an entity; each semantic occurrences contains the following information:
    • firstMember: an object representing the first entity in the relationship
    • secondMember: an object representing the second entity in the relationship
    • functionalRelation: a string describing the relationship; it can contain the following values:
      • ISA: the firstMember is a hyponym of the secondMember
      • PARTOF: the firstMember is part of the secondMember
      • MEMBEROF: the firstMember is a member of the secondMember
      • SYNONYM: the firstMember is a synonym of the secondMember
      • ANTONYM: the firstMember is an antonym of the secondMember
      • TRANSLATION: the firstMember is a translation of the secondMember
      • ABSTRACT: the firstMember semantically belongs to the secondMember
      • NEARTO: the firstMember contextualizes the secondMember
      • AGENT: the secondMember is an action accomplished by the firstmember
      • OBJ: the firstMember is an action accomplished on the secondMember
      • IOBJ: the firstMember is action whose indirect object is the secondMember
      • QUAL: the firstMember is qualified by the secondMember
      • HOW: the firstMember is an action and the secondMember indicates how it is accomplished
      • WHEN: the firstMember is an action and the secondMember indicates when it is accomplished
      • WHERE: the firstMember is an action and the secondMember indicates where it is accomplished
      • COMP: the firstMember is an action and the secondMember is its complement
      • ENTITY: the relationship between the firstMember and the secondMember is described by the thirdMember
    • thirdMember (optional): an object representing the entity that describes the relationship (present only if the functionalRelation is "ENTITY")
    • count: the number of times that the semantic occurrence appears in the clip

Sub-packages

Instead of downloading the whole clips, the download can be limited to a specific subset of clips.

  • Italian Clips (2.05 GB; unzipped 12.4 GB) : link
  • Italian Clips with validated sentiment (166 KB; unzipped 1 MB) : link
  • English Clips (4.01 GB; unzipped 24.1 GB) : link
  • English Clips with validated sentiment (168 KB; unzipped 1 MB) : link

Crawler Annotations package

Format: CSV
Download (189 MB; unzipped 2.52 GB): link

The ZIP file contains two CSV files: Crawler_annotations_Eng.csv and Crawler_annotations_Ita.csv. Each of them provides 40 metadata attributes that describe the clips distributed through the Clips package. Some of these metadata have been returned by the commercial crawling service Brandwatch (e.g., title, date, source MozRank, author information, and geolocalization), others have been manually annotated by the domain experts (e.g., source type). When the value "Unclassified" is used, it means that no data was available. The metadata are the following:

  • ID: the ID of the clip
  • LANGUAGE: the language of the clip
  • TW_VERIFIED: if the clip's source is Twitter, indicates whether the author has a verified Twitter account ("true") or not ("false")
  • FB_ROLE: if the clip's source is Facebook, indicates if the author is the owner of the post ("owner") or someone responding to someone else's post ("audience")
  • FB_SUBTYPE: if the clip's source is Facebook, indicates the type of post ("other", "photo", "status", "video")
  • CITY: the city from which the clip has originated
  • CITY_CODE: the code of the city from which the clip has originated
  • CONTINENT: the continent from which the clip has originated
  • CONTINENT_CODE: the code of the continent from which the clip has originated
  • COUNTRY: the country from which the clip has originated
  • COUNTRY_CODE: the code of the country from which the clip has originated
  • COUNTY: the county from which the clip has originated
  • COUNTY_CODE: the code of the county from which the clip has originated
  • STATE: the state from which the clip has originated
  • STATE_CODE: the code of the state from which the clip has originated
  • LATITUDE: the latitude from which the clip has originated
  • LONGITUDE: the latitude from which the clip has originated
  • CONTENT_LENGTH: the length of the clip's content
  • IMPRESSIONS: the sum of all followers of all who tweeted or retweeted (i.e. the potential number of users)
  • SOURCECHANNEL: the combination of the clip's source and channel type
  • CHANNEL_TYPE: the type of channel of the clip
  • SOURCE: the source of the clip
  • AUTHOR: the name of the author of the clip
  • AUTHOR_GENDER: the gender of the author
  • AUTHOR_N_TW_FOLLOWING: if the clip's source is Twitter, indicates the number of people followed by the author
  • AUTHOR_N_TW_FOLLOWERS: if the clip's source is Twitter, indicates the number of people following the author
  • AUTHOR_N_TW_POSTS: if the clip's source is Twitter, indicates the number of tweets by the author
  • AUTHOR_PROFESSION: the profession of the author
  • AUTHOR_CITY: the city of the author
  • AUTHOR_CITY_CODE: the code of the city of the author
  • AUTHOR_CONTINENT: the continent of the author
  • AUTHOR_CONTINENT_CODE: the code of the continent of the author
  • AUTHOR_COUNTRY: the country of the author
  • AUTHOR_COUNTRY_CODE: the code of the country of the author
  • AUTHOR_COUNTY: the county of the author
  • AUTHOR_COUNTY_CODE: the code of the county of the author
  • AUTHOR_STATE: the state of the author
  • AUTHOR_STATE_CODE: the code of the state of the author
  • SOURCE_RELEVANCE: the relevance of the source as defined by the domain experts ("L" for low, "M" for medium, "H" for high)
  • SOURCE_TYPE: the type of source as defined by the domain experts
  • SOURCE_SUPERTYPE: the supertype of source as defined by the domain experts
  • CHANNEL_SUPERTYPE: the supertype of channel as defined by the domain experts

Sentiment package

Format: CSV
Download (15 MB; unzipped 83 MB): link

The ZIP file contains two CSV files: Sentiment_Eng.csv and Sentiment_Ita.csv. Each of them provides the available sentiments for every clip distributed through the Clips package. In particular, each CSV contains:

  • ID: the ID of the clip
  • Crawler sentiment: provided for every clip, it is the sentiment obtained from the commercial crawling service Brandwatch
  • NLP sentiment: provided for every clip, it is the sentiment obtained from the commercial engine SyN-Semantic Center
  • Crowd sentiment: provided for 1188 clips per language, it is the sentiment obtained from the crowdsourcing process
  • Expert sentiment: provided for 594 clips per language, it is the sentiment obtained from the domain experts

Topic Occurrences package

Format: CSV
Download (102 MB; unzipped 331 MB): link

The ZIP file contains two CSV files: Topic_occurrences_Eng.csv and Topic_occurrences_Ita.csv. Each of them provides the occurrences of each topic within the clip distributeds through the Clips package. Each CSV contains the following columns:

  • CLIP_ID: the ID of the clip
  • TOPIC_ID: the ID of the topic
  • OCC: the number of occurrences of the topic within the clip

With the respective IDs, clips can be retrieved from the Clips package, while topics can be retrieved from the Topic Ontology package.

Topics and Mappings package

Format: OWL & TSV
Download (608 KB): link

The Topics and Mappings package is centered around the Topic Ontology, which organizes about 400 relevant topics and that was built by domain experts (i.e., a team of five socio-political researchers). It is structured into three sub-packages: Topic Ontology, Linked DBpedia Resources and Inter Language Mappings.

Topic Ontology package

Format: OWL
Download (56 KB): link
Italian ontology: http://big.csr.unibo.it/sabine_ita.owl
English ontology: http://big.csr.unibo.it/sabine_eng.owl

The topic ontology represents the set of concepts and relationships that, on the domain experts' judgement, are relevant to the subject area; its role in the SBI process is twofold: to act as a starting point for designing effective crawling queries on the one hand, and to support analyses based on relevant concepts (e.g., how often the public debt policy is mentioned) and on their aggregations (e.g., how often the sector of economics and its policies are discussed) on the other. The following image represents the classes in the ontology and the respective relationships.

Also, each topic is associated to a numerical IDs, which is used by the Topic Occurrences package to denote the occurrences of these topics in each clip.

Example

<owl:NamedIndividual rdf:about="sabine-eng:David_William_Donald_Cameron">
   <rdf:type rdf:resource="sabine:Politician"/>
   <sabine:belongsTo rdf:resource="sabine-eng:Conservative"/>
   <sabine:hasNation rdf:resource="sabine-eng:United_Kingdom"/>
   <sabine:holds rdf:resource="sabine-eng:Prime_Minister"/>
   <sabine:plays rdf:resource="sabine-eng:Party_Leader"/>
   <sabine:hasTopicID rdf:datatype="xsd:int">379</sabine:hasTopicID>
</owl:NamedIndividual>

Linked DBpedia Resources package

Format: OWL & TSV
Download (546 KB): link

This package links the topics defined by domanin experts in the Topic Ontology to the Linked Data Cloud, in particular to their corresponding resources in DBpedia. These links have been obtained by coupling automated techniques with manual validation and revision by the domain experts.

The package provides two files, topic_dbpedia_links.tsv and dbpedia_res.owl. The first file contains the links for both Italian and English topics in a tab-separated file, which contains the followings data:

  • The URI of the topic in the Topic Ontology
  • The URI of the related resource in DBpedia
  • The degree of similarity between the two resources, in the range [0,1]
  • The semantics of the link, which can be one of the following:
    • owl:sameAs : the subject resource and the object resource have exactly the same meaning
    • sabine:narrower : the meaning of the subject resource is more specific than the one of the object resource
    • sabine:broader : the meaning of the subject resource is more generic than the one of the object resource
    • sabine:related : there is a positive association between the meanings of the subject resource and of the object resource
  • The language (either "it" or "en")

The second file contains an extract from DBpedia of the resources reached by the links. This can be useful to avoid establishing a connection to DBpedia to gather information on such resources. Nonetheless, the extract is not actively maintained, therefore it may contain outdated data.

Example

An excerpt of the topic_dbpedia_links.tsv file.

sabine-eng:David_William_Donald_Cameron dbr:David_Cameron 0.669 owl:sameAs en
sabine-eng:Volunteer dbr:Volunteering 0.613 sabine:narrower en
sabine-eng:Migrant dbr:Migrant_worker 0.559 sabine:broader en
sabine-eng:Neoliberalism dbr:Thatcherism 0.930 sabine:related en

InterLanguage Mappings package

Format: OWL
Download (6 KB): link

This package provides a single file, eng_to_ita.rdf, which provides mappings between the Italian and the English topics defined in the Topic Ontology. Mappings have been manually created by the domain experts and they are provided as RDFAlignment maps. The semantics of the mapping can be one the following:

  • owl:sameAs : the subject topic is an exact translation of the object resource
  • sabine:related : the subject topic has a weaker semantic relationships to the object resource

Example

<map>
   <Cell>
      <entity1 rdf:resource="sabine-eng:Volunteer"/>
      <entity2 rdf:resource="sabine-ita:Volontario"/>
      <relation>owl:sameAs</relation>
      <measure rdf:datatype="xsd:float">1.0</measure>
   </Cell>
</map>

MD Cubes package

Format: CSV
Download (1.18 GB; unzipped 8.14 GB): link

This package contains two ROLAP (Relational OnLine Analytical Processing) cubes, namely the Sentiment cube and the Semantic Occurrences cube, which provide an easy-to-query representation of the clip content and of the outcome of the clip enrichment process. Essentially, the cubes provide a multi-dimensional representation of the data contained in the Clips, Crawler Annotations, Sentiment, TopicOccurrences, Topic Ontology and Inter Language Mappings packages. Each cube is made of several CSV tables (different languages have different tables), which have been exported from an Oracle 11g database using the export utility of SQL Developer. Additionally, the package provides a sub-package Inquiries, which enables an end-to-end assessment of a whole SBI process.

Sentiment cube

Format: CSV
Download (315 MB; unzipped 3.25 GB): link

The Sentiment cube is centered on clips, and it represents the set of topics appearing in each clip as well as the sentiment values computed for that clip. The schema of the cube is depicted in the following figure using the DFM notation, where cube measures are listed inside the box, dimensions are circles directly attached to the box, and hierarchies are shown as DAGs of dimension levels.

For each language, the following tables are provided:

  • FT_RAW_CLIP: the fact table; it contains foreign keys to the main dimension tables (i.e., DT_CLIP and DT_DATE) and measures of the clips, including sentiment values and metrics from Facebook and Twitter. Although the date functionally depends on the clips, it has been pushed down to the fact table to enable faster aggregations
  • DT_CLIP: the dimension table for the clip dimension; it contains the clips' metadata
  • DT_DATE: the dimension table for the date dimension
  • BT: the bridge table linking the clips in DT_CLIP to the topics in DT_TOPIC that appear in such clips
  • DT_TOPIC: the dimension table for the topic dimension; it is organized in accordance to the Topic Ontology
  • BT_POLICY_SECTOR: the bridge table linking the topics of class "policy" in DT_TOPIC to the respective sectors in DT_SECTOR
  • DT_SECTOR: the dimension table for the sector attribute
  • TOPIC_RELATION: expresses the roll-up relationships between the topics in DT_TOPIC
  • TOPIC_MAPPING: expresses the links between the DT_TOPICs for the two languages

Semantic Occurrences cube

Format: CSV
Download (896 MB; unzipped 4.89 GB): link

The Semantic Occurrences cube is centered on the semantic occurrence of POS entities within clips and explicitly models couples of entities in the same sentence together with an optional predicate. The schema of the cube is depicted in the following figure using the DFM notation, where cube measures are listed inside the box, dimensions are circles directly attached to the box, and hierarchies are shown as DAGs of dimension levels.

For each language, the following tables are provided:

  • FT_RELCONTAIN: the fact table; it contains foreign keys to the main dimension tables (i.e., DT_CLIP and DT_DATE, as well as DT_ENTITY for the first, second and third eneoty in the semantic occurrence) and the number of occurrences of the relationship between the specified entities on the specified clip. Although the date functionally depends on the clips, it has been pushed down to the fact table to enable faster aggregations
  • DT_CLIP: the dimension table for the clip dimension; it contains the clips' metadata
  • DT_DATE: the dimension table for the date dimension
  • DT_ENTITY: the dimension table for the entity dimension; when entities corresponds to topics, they are linked to the DT_TOPIC dimension
  • DT_TOPIC: the dimension table for the topic dimension; it is organized in accordance to the Topic Ontology
  • BT_POLICY_SECTOR: the bridge table linking the topics of class "policy" in DT_TOPIC to the respective sectors in DT_SECTOR
  • DT_SECTOR: the dimension table for the sector attribute
  • TOPIC_RELATION: expresses the roll-up relationships between the topics in DT_TOPIC
  • TOPIC_MAPPING: expresses the links between the DT_TOPICs for the two languages

Inquiries

Format: PDF (questions) & XLSX (answers)
Download (372 KB) - link

This package provides a set of 10 inquiries proposed by our domain experts to enable an end-to-end assessment of a whole SBI process (starting from clips and possibly enriched with different combinations of the benchmark components). The package provides two files: Inquiries.pdf lists the inquiries in natural language (also shown below) and provides, for each inquiry, the SQL query required to obtain the answer, either on the Sentiment cube or on the Semantic Occurrences cube; Inquiries.xlsx provides the answers for each inquiry. The inquiries in natural language are the following:

  1. Which is the most discussed sector in relationship with each political party?
  2. Which are the words most frequently associated to each politician on Twitter?
  3. Which are the most frequently discussed topics for each source type?
  4. How frequently are foreign politicians mentioned in each channel supertype?
  5. How is the discussion of policies differentiated amongst male and female authors?
  6. Are there any topics whose volume of discussion significantly changes from UK to Italy?
  7. What is the average sentiment expressed on EPGs and related Parties across different source types?
  8. How are policies discussed in domestic, European, and foreign perspective?
  9. How does the sentiment about each politician and technocrat change along time?
  10. What is the percentage of agreement between crowd and expert sentiment for each channel type?

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Acknowledgments

When citing please use:
Castano, S., Ferrara, A., Gallinucci, E., Golfarelli, M., Montanelli, S., Mosca, L., Rizzi, S., Vaccari, C.: Sabine: A multi-purpose dataset of semantically-annotated social content (2018), big.csr.unibo.it/sabine.

Contacts

For any question regarding SABINE, please contact the authors of the paper, which are listed on the right-hand side of this page.