Nell’ambito dei finanziamenti del “Bando Unico 2012″

SenTaClAus

Sentiment Tagging & Clustering Analysis on web & social contents

Onepage

Home

SenTaClAus il progetto di Ricerca Industriale e Sviluppo Sperimentale

ssenTaClAus è un progetto di Ricerca Industriale e Sviluppo Sperimentale presentato da Net7, insieme alle aziende Studio Flu Srl e SpazioDati Srl, alla Regione Toscana, nell’ambito dei finanziamenti del “Bando Unico 2012″.

Il progetto si è classificato undicesimo fra i circa 90 della linea A valutati ed è stato quindi ammesso al finanziamento.

SenTaClAus è un acronimo per “Sentiment Tagging & Clustering Analysis on web & social contents”, sigla che riassume i temi principali del progetto.

Da un lato infatti SenTaClAus verterà sull’analisi e sull’estrazione di semantica da documenti di testo (Text Analysis), mentre dall’altro si concentrerà sull’individuazione di tendenze, comportamenti e opinioni degli utenti a partire dall’analisi dei contenuti pubblicati sui Social Networks (Trend Analysis).

In SenTaClAus è poi prevista la collaborazione del gruppo Advanced Algorithms and Applications (A3) del Dipartimento di Informatica dell’Università di Pisa, che fa capo al Prof. Ferragina: la sua attività di ricerca nell’ambito di Entity Extraction e Semantic Tagging, che ha ottenuto numerosi riconoscimenti in ambito accademico, sarà alla base delle realizzazioni sperimentate nel progetto.



Il Progetto

Text and Trend Analysis

lla capacità di un elaboratore di manipolare grandi quantità di documenti di testo, che rappresentano la stragrande maggioranza dei contenuti digitali su web e non, dipende dalla possibilità di crearne delle “sintesi” machine-readable.

Esistono due approcci a questo problema: il primo, proveniente dal mondo dei motori di ricerca, consiste nell’indicizzare un insieme di parole senza tentare di carpirne il significato. Il secondo, più recente, punta sulla capacità di comprendere la semantica del testo individuando in esso i termini rilevanti e le relazioni che li legano.

L’affermarsi dei Social Network e la crescente importanza che essi rivestono nell’osservazione di dinamiche sociali e di mercato, ha portato ad un aumento vertiginoso della domanda di strumenti di Social Media Analysis.

Tali strumenti sono per lo più basati su analisi del testo di tipo sintattico, che, sebbene efficaci per i motori di ricerca, non sono sufficienti a catturare la complessità necessaria in questi contesti.

Il presupposto per poter realizzare strumenti efficaci di Social Media Analysis dipende quindi dalla qualità delle sottostanti tecnologie di analisi dei testi.

Sfruttando la conoscenza pregressa delle tre PMI coinvolte e del gruppo di ricerca A3 dell’Università di Pisa, diretto dal prof. Paolo Ferragina, il progetto SenTaClAus vuole svolgere delle attività di ricerca su sistemi software per:

– l’analisi e l’estrazione di semantica da documenti di testo (Text Analysis)
– l’individuazione di tendenze, comportamenti e opinioni degli utenti a partire dall’analisi dei contenuti pubblicati sui Social Networks (Trend Analysis).

Sulla base dei risultati sperimentali saranno costruiti dei prototipi volti a testare la possibilità di erogazione di servizi in ottica Cloud Computing sotto forma di software-as-a-service che, migliorando in modo significativo lo stato dell’arte, offriranno alle PMI un importante vantaggio competitivo.

I Partners

I Partners

Netseven
Netseven


Net7 nasce nel 2001 e si specializza subito come system integrator in ambiente Open Source.

Spazio Dati
Spazio Dati

Sfruttando tecnologie Big Data e Semantic Web, realizzano dataspace multidimensionali che aggregano centinaia di sorgenti dati Open e proprietarie.

Lab of Advanced Algorithms and Applications
Lab of Advanced Algorithms and Applications

Gruppo di ricerca dell’Università di Pisa, diretto dal prof. Paolo Ferragina

Demo

SenTaClAus Endpoints

tag

Named Entity Extraction & Linking – With this API you will be able to automatically tag your texts, extracting Wikipedia entities and enriching your data.

http://devsentaclaus.netseven.it/tag

We support both GET and POST methods to query the API.

Examples
parameters
text|url|html
[required]
These parameters define how you send text to the NEX API. Only one of them can be used in each request, following these guidelines:

  • use text when you have plain text that doesn’t need any pre-processing;
  • use url when you have an URL and you want the service to work on its main content; the API will fetch the URL for you, and use an AI algorithm to extract the relevant part of the document to work on; in this case, the main content will also be returned by the API to allow you to properly use the annotation offsets;
  • use html when you have an HTML document and you want the service to work on its main content, similarly to what the “url” parameter does.

Type: string

lang
[optional]
The language of the text to be annotated; currently Italian, English, French, German and Portuguese are supported. Leave this parameter out to let the service automatically detect the language for you.Type: string
Default value: auto
Accepted values: it | en | fr | de | pt | auto
min_confidence
[optional]
The threshold for the confidence value; entities with a confidence value below this threshold will be discarded. Confidence is a numeric estimation of the quality of the annotation, which ranges between 0 and 1. A higher threshold means you will get less but more precise annotations. A lower value means you will get more annotations but also more erroneous ones.
Type: float
Default value: 0.6
Accepted values: 0.0 .. 1.0
min_length
[optional]
With this parameter you can remove those entities having a spot shorter than a minimum length.
Type: integer
Default value: 2
Accepted values: 2 .. +inf
parse_hashtag
[optional]
With this parameter you enable special hashtag parsing to correctly analyze tweets and facebook posts.Type: boolean
Default value: false
Accepted values: true | false
include
[optional]
Returns more information on annotated entities:

  • types – adds type information from DBpedia. Types are extracted from the DBpedia of the language specified by the lang parameter. Please notice that different DBpedia instances may contain different types for the same resource;
  • categories – adds category information from DBpedia/Wikipedia;
  • abstract – adds the text of the Wikipedia abstract;
  • image – adds a link to an image depicting the tagged entity, as well as a link to the image thumbnail, served by Wikimedia. Please check the licensing terms of each image on Wikimedia before using it in your app;
  • sameas adds links to equivalent (sameAs) entities in Linked Open Data repositories or other websites. It currently only supports DBpedia and Wikipedia.

Type: comma-separated list
Default value: <empty string>
Accepted values: types, categories, abstract, image, sameas
Example: include=types,sameas

Response

The response is structured in JSON as follow:

{
  "timestamp": "Date and time of the response generation process",
  "time": "Time elapsed for generating the response (milliseconds)",
  "lang": "The language used to tag the input text",
  "langConfidence": "Accuracy of the language detection, from 0.0 to 1.0. Present only if auto-detection is on",
  "text": "The annotated text. Present only if the 'url' or 'html' parameters have been used",
  "url": "The actual URL from which the text has been extracted. Present only if the 'url' parameter has been used",
  "annotations": [
    {
      "id": "ID of the linked Wikipedia resource",
      "title": "Title of the linked Wikipedia resource",
      "uri": "URL of the entity on Wikipedia",
      "label": "Most common name used to represent the resource",
      "confidence": "Value of confidence for this annotation",
      "spot": "Annotated string, as it is in the input text",
      "start": "Character position in the input text where the annotation begins",
      "end": "Character position in the input text where the annotation ends",
      "types": ["List of types of the linked DBpedia resource","Only if 'include' parameter contains 'types'"],
      "categories": [
        "List of the category of the linked DBpedia resource",
        "Only if 'include' parameter contains 'categories'"
      ],
      "abstract": "Abstract of the linked Wikipedia resource. Only if 'include' parameter contains ­'abstract'",
      "lod": {
        "wikipedia": "URL of the Wikipedia article that represents the resource",
        "dbpedia": "URI of the resource on DBpedia"
      },
      "image": {
        "full": "URL of a depiction of the resource on Wikimedia. Only if 'include' parameter contains 'image'",
        "thumbnail": "URL of the thumbnail of the depiction. Only if 'include' parameter contains 'image'",
      }
    }
  ]
}

similarity

Semantic sentence similarity API optimized on short sentences. With this API you will be able to compare two sentences and get a score of their semantic similarity. It works even if the two sentences don’t have any word in common.

http://devsentaclaus.netseven.it/similarity

We support both GET and POST methods to query the API.

Examples
parameters
text1|url1|html1
[required]
These parameters define how you send to the API the first text you want to compare. Only one of them can be used in each request, following these guidelines:

  • use text when you have plain text that doesn’t need any pre-processing;
  • use url when you have an URL and you want the API to work on its main content; the API will fetch the URL for you, and use an AI algorithm to extract the relevant part of the document to work on; in this case, the main content will also be returned by the API to allow you to properly use the annotation offsets;
  • use html when you have an HTML document and you want the API to work on its main content, similarly to what the “url” parameter does.

Type: string

text2|url2|html2
[required]
These parameters define how you send to the API the second text you want to compare, in the same way as the text1|url1|html1 parameters.Type: string
lang
[optional]
The language of the text to be compared; currently Italian, English, French, German and Portuguese are supported. Leave this parameter out to let the service automatically detect the language for you.Type: string
Default value: auto
Accepted values: it | en | fr | de | pt | auto
include_annotations
[optional]
Enables a detailed report with the annotation’s sets used to calculate the similarity.Type: boolean
Default value: false
Accepted values: true | false
Response

The response is structured in JSON as follow:

{
  "timestamp": "Date and time of the response generation process",
  "time": "Time elapsed for generating the response (milliseconds)",
  "lang": "The language used to compare the given texts",
  "langConfidence": "Accuracy of the language detection, from 0.0 to 1.0. Present only if auto-detection is on",
  "similarity": "Similarity of the two given texts, from 0.0 to 1.0. Higher is better"
}

classify

Classifies short documents into a set of user-defined classes. It’s a very powerful and customizable tool for text classification. To define your own models, please refer to User-defined models.

http://devsentaclaus.netseven.it/classify

We support both GET and POST methods to query the API.

parameters
text1|url1|html1
[required]
These parameters define how you send to the API the first text you want to classify. Only one of them can be used in each request, following these guidelines:

  • use text when you have plain text that doesn’t need any pre-processing;
  • use url when you have an URL and you want the API to work on its main content; the API will fetch the URL for you, and use an AI algorithm to extract the relevant part of the document to work on; in this case, the main content will also be returned by the API to allow you to properly use the annotation offsets;
  • use html when you have an HTML document and you want the API to work on its main content, similarly to what the “url” parameter does.

Type: string

model
[required]
The “unique ID” of the model you want to use. If you want to learn how to manage your custom models, please refer to User-defined models.
min_score
[optional]
Returns those categories that get a score above this threshold. There is not a gold-value for such parameter that works for every model, moreover it really depends on your use-case. Start experimenting with 0.25 and increase/decrease it depending on the results.
Default value: 0.0
Accepted values: 0.0 .. 1.0
Advanced parameters
max_annotations
[optional]
The Classifier uses the Annotator (tag) under the hood. With this parameter you can limit the number of annotations to be used for classifying the text, using only the top-most entities by their confidence.Default value: +inf
Accepted values: 1 .. +inf
include
[optional]
Returns more information about the classification process:
“score_details”: we added this parameter for debug purposes – it will output, for each entity in the model categories, a weight value that represents how much they have influenced the overall score of their category. For each category, the weights sum up to 1.
Pay attention: this parameter can be used only by the model owner.Default value: <empty string>
Accepted values: score_details
Example: include=score_details
Response

The response is structured in JSON as follow:

{
  "timestamp": "Date and time of the response generation process",
  "time": "Time elapsed for generating the response (milliseconds)",
  "lang": "The language used to classify the input text (defined in the model)",
  "categories": [
    {
      "name": "The name of the category",
      "score": "The score of the category",
      "scoreDetails": {
        "entity": "URI of the entity. Only if 'include' parameter contains 'score_details'",
        "weight": "Weight of the entity. Only if 'include' parameter contains 'score_details'",
      }
    }
  ]
}

models

Manages all your classification models, following the CRUD(L) paradigm. With this endpoint you will be able to integrate classification into your own applications. If you have already created your model and you want to try it out, please refer to the classify API.

Create a new model

POST http://devsentaclaus.netseven.it/cl-model
parameters
data
[required]
The model you want to create, structured as described below.

Returns the submitted data wrapped with the unique id designating the model.

Read a specific model

GET http://devsentaclaus.netseven.it/cl-model
parameters
id
[required]
The id of the model you want to fetch.

Returns the requested model.

Update an existing model

PUT http://devsentaclaus.netseven.it/cl-model
parameters
id
[required]
The id of the model you want to update.
data
[required]
The updated model, structured as described below.

Returns the updated model.

Delete a model

DELETE  http://devsentaclaus.netseven.it/cl-model
parameters
id
[required]
The id of the model you want to delete.

List all your models

GET  http://devsentaclaus.netseven.it/cl-model
parameters
id
[required]
The id of the model you want to delete.

Returns the updated model.

Model structure

A Model is simply composed by a list of categories, each defined as a set of entities represented as (weighed) Wikipedia pages, which “describe” the category itself. Writing your own model is quite simple! Need a Sport category? You could represent it as:

http://en.wikipedia.org/wiki/Baseballl 
http://en.wikipedia.org/wiki/Basketball
http://en.wikipedia.org/wiki/Footbal

around 10 entities per category usually do the trick.

In general, a model is defined following this structure:

{
  "lang": "The language the model will work on",
  "description": "A human-readable string you can use to describe this model",
  "categories": [
    {
      "name": "The category name",
      "topics": {
        "topic1": "weight",
        "topic2": "weight",
        "...": "...",
      }
    }
  ]
}

Topics are represented as wikipedia pages. You can refer to each topic by its URI http://en.wikipedia.org/wiki/Baseball, its title Baseball or its wikipedia page ID 3850. In this two last cases, the lang attribute will be used to select the Wikipedia to match against.

Model example

A very simple example to have an idea of what it means to define custom models.

{
  "description": "My first model for classifying news",
  "lang": "en",
  "categories": [
    {
      "name": "Sport",
      "topics": {
        "http://en.wikipedia.org/wiki/Sport": 2.0,
        "http://en.wikipedia.org/wiki/Baseball": 1.0,
        "http://en.wikipedia.org/wiki/Basketball": 1.0,
        "http://en.wikipedia.org/wiki/Football": 1.0
      }
    },
    {
      "name": "Politics",
      "topics": {
        "Politics": 2.0,
        "Politician": 1.5,
        "Brack Obama": 1.0,
        "David Cameron": 1.0,
        "Angela Merkel": 1.0
      }
    }
  ]
}

utilities

Usefull services for internal debug/support.

Tag debug

Basic ui to evaluate the work of the annotator/disambiguator:
devsentaclaus.netseven.it/_debug.html


Spotify

Identifies all the spots in the text:

http://devsentaclaus.netseven.it/_spotify
Examples
parameters
text
[required]
The plain text to be analyzed
Type: string
lang
[required]
The language of the text; currently Italian, English, French, German and Portuguese are supported.
Type: string
Accepted values: it | en | fr | de | pt

Topic info

Returns the topic’s infos:

http://devsentaclaus.netseven.it/_topic
Examples
parameters
topic
[required]
The topic to be searched. Can be specified either by title, by id or by the url
Type: string
lang
[required]
The language of the text; currently Italian, English, French, German and Portuguese are supported.
Type: string
Accepted values: it | en | fr | de | pt

Documenti

Documentazione del progetto

Pubblichiamo di seguito una selezione dei deliverable di progetto.

  • Presentazione delle API di Text Analysis sviluppate in SenTaClAus. 2° Open Day di Progetto, Navacchio (Pisa) 02/10/2014

  • Presentazione del servizio di Trend Analysis sviluppato in SenTaClAus

Eventi e convegni SenTaClAus

  • 1° Open Day di progetto – Net7, Pisa, 14/02/2014
  • 2° Open Day di Progetto – Auditorium del Polo di Navacchio (Pisa), 02/10/2014
  • Conferenza di progetto, Dipartimento di Informatica dell’Università di Pisa, 23/03/2015


Pubblicazioni scientifiche prodotte in SenTaClAus

  • Ugo Scaiella, Michele Barbera, Stefano Parmesan, Gaetano Prestia, Emilio Del Tessandoro e Mario Verì, DataTXT at #Microposts 2014. In Proceedings of the 4th Workshop on Making Sense of Microposts (#Microposts2014) at International World Wide Web Conference (WWW ’14). [ref, pdf]
  • Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A framework for benchmarking entity-annotation systems. In Proceedings of the 22nd international conference on World Wide Web (WWW ’13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 249-260. [ref, pdf]

Seminari ed eventi

Il Prof. Paolo Ferragina, quale direttore del laboratorio A³ del Dipartimento di Informatica dell’Università di Pisa e partner del progetto SenTaClAus, ha avuto modo di presentare parte dei risultati del progetto ai seguenti eventi/seminari:

  • Invited talk su “Reti Sociali: algoritmi, analisi del linguaggio e applicazioni” nell’ambito dell’incontro su Social Banking organizzato da KPMG, svoltosi a Milano. Maggio 2013.
  • Invited speaker alla PhD School su “Computational Social Science: Big Data”, Lipari, Luglio 2013. Ha tenuto un corso su “Beyond the Bag-Of-Words Paradigm”.
  • Workshop su “ICT and Knowledge Acceleration” nell’ambito delle attività di valorizzazione della ricerca dell’Università di Pisa e del Polo ICT toscano, con la partecipazione di diversi stakeholder dell’ecosistema del Trasferimento Tecnologico regionale e nazionale. Settembre 2013.
  • Paolo Ferragina, Algorithmic challenges in data storage and indexing, workshop on Next Generation Data Center in the context of the European Conference on Network and Communication, Bologna, June 23,2014
  • Paolo Ferragina e Raffaele Perego, Motori di Ricerca, evento nell’ambito del T-Tour presso l’Internet Festival, Ottobre 2014, Pisa.
  • StartUp Saturday, Firenze 13 Dicembre 2014, Paolo Ferragina (Università di Pisa) e Gabriele Antonelli (founder SpazioDati), “Big Data e Motori di Ricerca”. (http://www.startupsaturday.it/events/come-utilizzare-i-big-data-per-il-business/ )

Altre pubblicazioni scientifiche attinenti al progetto

  • Daniele Vitale, Paolo Ferragina, and Ugo Scaiella. 2012. Classification of short texts by deploying topical annotations. In Proceedings of the 34th European conference on Advances in Information Retrieval (ECIR’12), Ricardo Baeza-Yates, Arjen P. Vries, Hugo Zaragoza, B. Barla Cambazoglu, and Vanessa Murdock (Eds.). Springer-Verlag, Berlin, Heidelberg, 376-387. [ref]
  • Ugo Scaiella, Paolo Ferragina, Andrea Marino, and Massimiliano Ciaramita. 2012. Topical clustering of search results. In Proceedings of the fifth ACM international conference on Web search and data mining (WSDM ’12). ACM, New York, NY, USA, 223-232. [ref]
  • Paolo Ferragina and Ugo Scaiella. 2012. Fast and Accurate Annotation of Short Texts with Wikipedia Pages. IEEE Softw. 29, 1 (January 2012), 70-75. [ref]
  • Paolo Ferragina and Ugo Scaiella. 2010. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM ’10). ACM, New York, NY, USA, 1625-1628. [ref, tech. rep.]

Altri seminari ed eventi attinenti al progetto

  • Invited speaker alla Industrial-Track della European Conference on Information Retrieval (ECIR), Barcellona (ES), con un talk dal titolo “Topic-based annotation of short texts, with applications”. Aprile 2012.
  • Seminario di 8 ore dal titolo “I motori di ricerca: passato, presente e futuro prossimo”, svoltosi nell’ambito delle attività di scouting e marketing svolte da Lucense (Lucca) quale soggetto gestore del Polo di Innovazione INNOPAPER. Ottobre 2012.