Change language Go to Italian Version
 [PRINT] 

Event data workshop

Tool developed within the POLIS-EYE project

To detect and create a DataBase (DB) of events (concerts, exhibitions, festivals, meetings, major sporting events, etc.) taking place in Emilia-Romagna, a tool has been developed and fine-tuned by ENEA.

The software is an online tool that helps the user in the massive detection of events from a heterogeneous set of online sources; the software operates according to three main phases:
1. Automatic querying of more than 30 Internet sites to find scheduled events and storage in working databases where they are subjected to a further refinement and normalization phase (ASP component);
2. Processing, filtering and integration of the acquired information, adding further undetected events (e.g. "context" events such as holidays, events provided by project stakeholders, etc.) (JAVA component);
3. Storage of the result in a relational DB (Postgres).

The database was built following an abstract model and a categorization of events, created specifically for this purpose, but strongly based on the E015 specifications, to take into account the different characteristics of the event (date of the event, location, theme of the event, categorization, capacity of the place, etc.).


Details about the abstract data model:
The events follow the abstract model developed by ENEA in the POLIS-EYE project, which catalogs the events according to the type, place and time of occurrence, the organizer, the target user and other items.
The abstract model was built ad hoc but taking into account the specifications created as part of the E015 initiative of the Lombardy Region (created on the occasion of Expo 2015 to collect data on events from heterogeneous sources). The events were also divided, according to type, into:

a. Events, i.e. single events (e.g. concerts), periodic events (e.g. weekly markets), internal events (project partners, e.g. FICO), event reviews (film festivals held over several days).
b. Anniversaries (holidays, patron saints);
c. Circumstances, i.e. "contextual" events (lockdown, school calendar, etc.)

Web sources for events.
The Event Dataworkshop software automatically and repeatably analyzes a list of event websites regarding cultural, sporting events (football matches, basketball, etc.), musical, food and wine, theatrical events, etc. The sites are updated both by institutional bodies (Emilia-Romagna, Municipality of Bologna, etc.) and by private entities (BolognaToday, etc.). More than 30 websites were selected starting from a panel of 80; some are actually localizations of the same application (this is the case - for example - of the www.XXXtoday.it sites: www.bolognatoday.it, www.modenatoday.it, etc.) so the list of sites can be summarized as follows :

- ABC xx (RSS)
- Basketball Serie A
- Basketball series A2
- Basketball Super Cup series 2019-20
- Bologna welcome events
- Football League Serie A
- Metropolitan city - festivals
- Emilia-Romagna welcome
- EVENTA (BO, FA, RN)
- Events and festivalsItineraries in taste
- SagreinRomagna
- SoloSagre Emilia-Romagna
- xToday Events (BO, PR, FA2, MO, RN)


Processing mode
The data was processed partly in ASP (especially for the automatic search on websites and the identification of the geographical location and places of events) and partly in Java (integration between different sources of events, elimination of duplicates, completion of information and data, writing in the DB, etc.). The data was searched by querying the sites, even multi-page ones, to extract information either from the metadata (for example JSON inserts inserted in the non-visible part of the page) or based on the HTML structure of the page (identifying constant elements over time relating to each piece of information , for example the city in the second column of a page's 'location' class table). This approach, while allowing the recognition of structured data, is subject to the variability over time of the WEB applications which generate the web pages of each event from an internal DB. It should be noted that, even if used, metadata is not a reliable and complete source of information.

Identification of places.
An important component is the identification and normalization of the places of execution. Initially, around 4900 places (stadiums, theatres, villas, bookshops, etc.) where events take place (in Emilia-Romagna), with a precise address, were identified and classified. This set was achieved by eliminating incomplete and or equivalent addresses from the same container. Subsequently, approximately 1400 more generic venues were added (e.g. only the Municipality or Province where the event was held is indicated, as no other information was found or because it concerns an entire territory, such as a celebration of a Patron Saint of a Municipality) for a total of approximately 6270 places6. Of these places, a part (2985 places, equal to 48%) has precise geographical coordinates (longitude, latitude), obtained mostly from automatic querying by OpenCage7

Operational approach of phase 1 based on ASP libraries:
- 0. Choice of entities on which to activate the tool (events, places, )
- 1. Analysis or "Ingestion Workshop" (indication of the relative sources to be analysed, development of the search and extraction parameters); this phase involves - for each source - the definition of specific rules for interpreting/extracting the web page -html, json, xml-)
- 2. Mining (mass extraction activity and primary data storage)
- 3. Fusion (data normalization and event recognition and categorization activities for structured data storage according to the adopted description and classification scheme)
- 4. Letter of request (automated search on external databases of information relating to the places mentioned for the purpose of georeferencing the places and eliminating synonyms); this phase requires intervention to manually indicate the synonyms which will then have to be mapped onto a unique name of the place)
- 5. Delivery (production of a structured file to be passed to the Java component of phase 2).


The ASP libraries are available 'as is' in the downloadable zip file (they require an IIS server based operating environment which is not made available here).



Interface documentation: US213-001




 statistics about analysed sources \tecnopolo\images\USerInterfaceEDW1.jpg
statistics about analysed sources

 Outcome of the analysis, information about one event \tecnopolo\images\USerInterfaceEDW.jpg
Outcome of the analysis, information about one event


Resources

DOC - DATAWORKSHOP-download DATAWORKSHOP-download.zip
 [PRINT] 


   


 

POR FESR

logo rete alta tecnologia emilia romagna

Fondo per lo sviluppo e la coesione
Il Laboratorio ha realizzato progetti finanziati dai Fondi europei della Regione Emilia-Romagna e dal Fondo per lo sviluppo e la coesione
Attrattività Ricerca Territorio - Emilia-Romagna
Sitemap
Termini di uso
Politica sulla Privacy
Accessibilita'

Share this page with

LinkedIN share Facebook share share
Dichiarazione di accessibilità 6d66ae69-c6fd-4cb9-b536-be3fdfb0144c