Line 61: | Line 61: | ||
=== Translator Module === | === Translator Module === | ||
− | This module is language-specific. It's one of the most important parts of the program as it's responsible for translating any sentence of any language into a language-independent symbolic | + | This module is language-specific. It's one of the most important parts of the program as it's responsible for translating any sentence of any language into a language-independent symbolic form. |
For example, the sentence "The cat ate the mouse" contains : | For example, the sentence "The cat ate the mouse" contains : | ||
Line 78: | Line 78: | ||
− | That kind of symbolic representation of sentences is valid in any language, all you need is a symbol database for nouns, verbs, adjectives, adverbs | + | That kind of symbolic representation of sentences is valid in any language, all you need is a symbol database for nouns, verbs, adjectives, adverbs, etc. Automatic contextual translation of texts has made incredible leaps forward these last few years, as shown by Google in their excellent presentation video of Wave where their [http://andrewhitchcock.org/?post=322 automatic translation tool] performed quite brightly. |
The translator module is the first part of the semantic analysis and needs to be carefully written for each language but I believe it's possible to make the code quite reusable so only minor changes need to be made for each language. The symbolic representation of sentences hence obtained could also serve as a base for automatic translation of text. | The translator module is the first part of the semantic analysis and needs to be carefully written for each language but I believe it's possible to make the code quite reusable so only minor changes need to be made for each language. The symbolic representation of sentences hence obtained could also serve as a base for automatic translation of text. | ||
Line 87: | Line 87: | ||
The purpose of this module is to tie actual ''facts'' to ''people'' (or names or brands). It also needs to understand the context in which the names and facts are quoted. | The purpose of this module is to tie actual ''facts'' to ''people'' (or names or brands). It also needs to understand the context in which the names and facts are quoted. | ||
− | ==== Topic ==== | + | :: ==== Topic ==== |
− | It will make a massive use of synonyms and lexical fields databases to unroll the style of some reporters into plain understandable prose. For example, if a text or group of texts deals with the trial of some guy, the reporter | + | It will make a massive use of synonyms and lexical fields databases to unroll the style of some reporters into plain understandable prose. For example, if a text or group of texts deals with the trial of some guy, the reporter may have used the entire lexical field pertaining to justice trials like "sued", "affair", "jury", "tribunal", "accusation", "witness", "prosecuted" and so on. |
− | The semantic analyser should be able to statistically deduce the | + | The semantic analyser should be able to statistically deduce the topic of an article or the paragraph of an article by the amount of words belonging to the same lexical field. |
− | ==== Subject ==== | + | :: ==== Subject ==== |
To determine the targets or subjects of a text, that is the names of the people/brands/companies involved, the analyser could exploit the generally accepted convention that proper names start with a capital letter. | To determine the targets or subjects of a text, that is the names of the people/brands/companies involved, the analyser could exploit the generally accepted convention that proper names start with a capital letter. | ||
Also, some conventions on pre- or post-fixes on the names can give additional informations. For example, Pr. for professor, PhD for a science doctor, MD for a medical doctor, Mrs. for a married woman and so on. | Also, some conventions on pre- or post-fixes on the names can give additional informations. For example, Pr. for professor, PhD for a science doctor, MD for a medical doctor, Mrs. for a married woman and so on. | ||
Line 98: | Line 98: | ||
The analyser should rely on a names database that it would either use or update according to the fact that the names already exist or have just been encountered for the first time. | The analyser should rely on a names database that it would either use or update according to the fact that the names already exist or have just been encountered for the first time. | ||
+ | :: ==== Places & Context ==== | ||
+ | By relying on a places database, the analyser should be able to determine the place where a given event occurred. | ||
− | + | As for the when, the analyser can first isolate a time frame by using the date of the article but also a reference time the text could mention. Time stamp signatures are usually quite easy to retrieve and only the attachment of a time stamp to a specific event is difficult depending on the structure of the sentence. | |
− | |||
− | date of the article to | ||
− | |||
+ | By examining the possessive forms of sentences, it would be possible to attach parts of sentence to others as "belonging to". | ||
− | ==== Guided Learning ==== | + | :: ==== Aggregation ==== |
− | The analyser should ask | + | Aggregation of data by alarms |
+ | |||
+ | For example, the sentence "Mr. Harrison, a research director at Bell (Connecticut) for 15 years, told us that (...)" is the typical kind of sentence we would like to store in the database as it ties a subject (Mr. Harrison) to a company (Bell in Connecticut). It also gives the man's position (research director) and an approximate time frame (15 years starting from the date of the article) during which he has occupied that position. | ||
+ | |||
+ | Another quote of the name of Harrison in another article regarding Bell would trigger alarms that would warn us that possibly another part of the life of Mr. Harrison would be unveiled. | ||
+ | |||
+ | |||
+ | :: ==== Association ==== | ||
+ | Association of people by analysis of their relation in a text. | ||
+ | |||
+ | :: ==== Guided Learning ==== | ||
+ | The analyser should ask us infos if below a given "certainty threshold". Especially when referring to databases to avoid homonyms and stuff, and to guide its learning. |
Revision as of 05:01, 12 February 2010
Okay, let me try and explain what this great idea is about.
First, let's see some basic politics as an introduction to what I will expose later.
Contents
Politics
First, you should know I'm a fucking social anarchist . And as such, I don't like the inherent hierarchy of the powers in place and their ugly scheming to get to the top. I just can't stand politics and corruption. I just can't grasp the concept of lust for power and money. And I can't even begin to understand why someone who has enough money to buy a small country just needs even more.
Let me be clear on these thoughts : I don't want to blow everything up, shoot everyone and make a revolution. Our capitalist system is obviously far from perfect but I believe it can be "mended" in many ways so we achieve more equality in revenues and so a huge part of the world isn't left aside like junk. I would have liked not to quote the obvious here, like the richest 2% owning half the wealth of the planet, or that the cost of the war in Iraq itself would have permitted to buy all the weapons we're so afraid of, or even that it would cost 40 billion $ annually to feed the hungry (the budget of the G8 summit where "important" people discuss of this matter costing $600 million on its own) but I write these small facts here as a memento for some other time.
Keeping that in mind, after spending many years being angry at everything, we need to focus on finding ways to change things and make the system more equal.
Politicians, whether they are left-wing liberal democrats or right-wing conservative republicans, all want the same thing : power. They usually have a short-term vision of things essentially because of their equally short-term mandates. Often laws and amendments get voted to be overruled 1 or 2 presidential mandates later, yielding a brownian-motion-like status quo. Also, these politicians are almost always issued from bourgeoisie and aristocracy, they are formed and taught in high-standard schools whose diplomas always guarantee a successful career. These people are NOT your friends, they are not of your class and don't know the cost or the difficulties and precarity of life, yet here they are trying to solve your problems they don't have a clue about. They only are theoreticians of life.
Also, it is my intimate conviction that politicians have no real power anymore and don't rule their country as they used to : multi-national corporations do through lobbying and economic pressure. Politicians can only limit the damages caused to their countries by these corporations (when they are willing to do so) by applying mere patches and solving neighborhood-range crises, when they are not altogether at the mercy of such corporations through either economic blackmail or mere corruption.
Politicians have become CEOs of their countries they now run like mere corporations. We're the employees. Revolving doors between government positions and private sector companies work 24-7. Conflicts of interest are now showing blatantly in the open and are part of the system.
Another one of my convictions is that the economic system in place is anti-human in all its forms. I mean it's not in the interest of the market, ever, that people are happy and in good health ! If people were all well fed, all had shelter, all were in good health and were all happy with simple facts of life instead of pursuing "happiness" through consumption, then the market would collapse.
What we need is a way to make the people in place take their job seriously. We need a way to monitor what they are doing, to understand what their agendas really are and what possible conflicts of interest they are in : we need to find a way to make them do the work they were elected for. The "affairs" newspapers sometimes leak are mere accidents, I'm sure there are hundreds of these affairs we never hear of and that's a shame. If we ever had a way to somewhat automatically find relations between people, trace their life and monitor their quotes and achievements then we would have a tool to actually "measure" the honesty and value of these people.
They are, after all, public persons elected by the public. It's only fair to assume they should be accountable to the public !
What's the Relation with Semantic Analysis ?
What I'm proposing here is a tool to help people monitor public persons.
The idea is quite simple really : we need to create a program that automatically analyses all possible documents (newspapers mainly, proceedings, reports, bulletins) that pertain to the public life of public persons and build a huge facts database or FDB. This is not spying on these people but merely collecting data on them through quotations of existing documents.
In the end, the FBD should contain a pretty amazing summary of the career of public people. Also, it should contain very useful information on the relationships and collaboration between people. And when I say people, I also mean corporation CEOs and their companies as well (which are now accountable as moral persons according to the law).
Using a simple system of scoring for relationships and public affairs, it should be fairly easy to give "grades" to the public persons, companies or to the facts themselves ranging from "truthful" to "very doubtful". As an example, if we somehow managed to find a connection between a scientific report about the utility of OGMs written by someone who used to work for a company that was at some point commissioned by Monsanto for a project, it would be quite difficult to give a "truthful" grade to that report. There would be a clear conflict of interest here, but it would only be made clear by the program really, an investigation journalist could do that too but that would be a lot of work and journalists are not always free of interest either.
Now you're starting to understand where I'm going.
Program Description
To achieve this, we need to separate the program in several stages :
- Bots that will be used to collect and update data from known sources, mainly online newspaper archives and "trusted" sources
- A Lexical Analyser that will be used to verify the lexical validity of a text prior feeding it for translation and semantic analysis
- A Translator Module that will be used to format the text in a universally readable format so the semantic analyser can be independent of the source language
- The Semantic Analyser that will perform the semantic analysis of the language-independent text and that will basically tie facts to names
- The Facts Database that will store facts and their relation to people, brands or companies
- The Query Engine that will be able to answer user queries and display usable information
Bots
The bots will need to be written specifically for the target site to harvest the data as they are provided by the target site. The main code that grabs the text will be the same for all sites but the part that posts requests to the site will have to be specific to the site itself.
Also, if the site changes presentation or access permissions, the bot should handle failures elegantly and warn us that the code needs to be changed to fit the new site requirements.
The bots should also be able to determine if the text is part of a group of texts as newspapers often choose to write several articles on the same topic and these texts should then be marked as treating of the same subject.
Lexical Analyser
This part is language-specific and should be used to verify the validity of the text. It should do some basic checks like syntax and spelling, punctuation and pre-formatting so the text is ready for translation.
Translator Module
This module is language-specific. It's one of the most important parts of the program as it's responsible for translating any sentence of any language into a language-independent symbolic form.
For example, the sentence "The cat ate the mouse" contains :
- "the cat", a definite subject
- "ate", a verb at the past tense
- "the mouse", a definite object or target
Let's put A is the symbol for "cat", B is the symbol for "eat" and C is the symbol for "mouse".
Let's also define the "d" subscript for "definite" ("the", as opposed to "a" or "some").
Finally, let's define the "p" subscript for "past" or "preterit".
We could then write the sentence as :
<math>\mathbf{A_d} \to \mathbf{B_p} \to \mathbf{C_d}</math>
That kind of symbolic representation of sentences is valid in any language, all you need is a symbol database for nouns, verbs, adjectives, adverbs, etc. Automatic contextual translation of texts has made incredible leaps forward these last few years, as shown by Google in their excellent presentation video of Wave where their automatic translation tool performed quite brightly.
The translator module is the first part of the semantic analysis and needs to be carefully written for each language but I believe it's possible to make the code quite reusable so only minor changes need to be made for each language. The symbolic representation of sentences hence obtained could also serve as a base for automatic translation of text.
Semantic Analyser
This module is generic and feeds on the symbolic text representation.
The purpose of this module is to tie actual facts to people (or names or brands). It also needs to understand the context in which the names and facts are quoted.
- ==== Topic ====
It will make a massive use of synonyms and lexical fields databases to unroll the style of some reporters into plain understandable prose. For example, if a text or group of texts deals with the trial of some guy, the reporter may have used the entire lexical field pertaining to justice trials like "sued", "affair", "jury", "tribunal", "accusation", "witness", "prosecuted" and so on.
The semantic analyser should be able to statistically deduce the topic of an article or the paragraph of an article by the amount of words belonging to the same lexical field.
- ==== Subject ====
To determine the targets or subjects of a text, that is the names of the people/brands/companies involved, the analyser could exploit the generally accepted convention that proper names start with a capital letter. Also, some conventions on pre- or post-fixes on the names can give additional informations. For example, Pr. for professor, PhD for a science doctor, MD for a medical doctor, Mrs. for a married woman and so on.
The analyser should rely on a names database that it would either use or update according to the fact that the names already exist or have just been encountered for the first time.
- ==== Places & Context ====
By relying on a places database, the analyser should be able to determine the place where a given event occurred.
As for the when, the analyser can first isolate a time frame by using the date of the article but also a reference time the text could mention. Time stamp signatures are usually quite easy to retrieve and only the attachment of a time stamp to a specific event is difficult depending on the structure of the sentence.
By examining the possessive forms of sentences, it would be possible to attach parts of sentence to others as "belonging to".
- ==== Aggregation ====
Aggregation of data by alarms
For example, the sentence "Mr. Harrison, a research director at Bell (Connecticut) for 15 years, told us that (...)" is the typical kind of sentence we would like to store in the database as it ties a subject (Mr. Harrison) to a company (Bell in Connecticut). It also gives the man's position (research director) and an approximate time frame (15 years starting from the date of the article) during which he has occupied that position.
Another quote of the name of Harrison in another article regarding Bell would trigger alarms that would warn us that possibly another part of the life of Mr. Harrison would be unveiled.
- ==== Association ====
Association of people by analysis of their relation in a text.
- ==== Guided Learning ====
The analyser should ask us infos if below a given "certainty threshold". Especially when referring to databases to avoid homonyms and stuff, and to guide its learning.