Monday, September 23, 2013

The importance of being Semantic

Aristotelian Triangle

“Aristotle teaches us what names and words designate and that from one side there is mental representations (Noemata) and from the other side the process of naming and designation is being realized by the means of a designator (Subject) and an object and that one should not add any kind of intermediate element between the thought and the object.”

  In other words, it is inherent in people to conceptualize the meaning of a symbol and the object that it refers to. This process relies on subjective reasoning(on some extend) and thus may not be universal, but nevertheless is common ground among all thinking animals. On a higher level logic-thinking animals (like humans) may not only understand the meaning of a term but may also "rank" it when confronting it with other terms. This demonstrates the capability of logic-thinking animals to reason and deduce facts implicitly. On the other hand computers may have been evolved in a logarithmic manner and may have strong computing capabilities but still lack of certain aspects that would classify them as intelligent thinkers. Since computers can religiously follow only rules, the Aristotelian Triangle is of no use in this case.

  Semantics do not mean anything

  Lazy but big steps towards a semantic universe have led us to a series of techniques and theories that can potentially give us the possibility to build an actual thinking machine. Computational linguistics and natural language processing, collaborative filtering, models developed using data mining and machine learning algorithms, AI, processes of filtering for information or patterns, methods of making automatic predictions and decisions etc.. All of the above techniques are directly (or indirectly) trying to resolve the problem of semantics.

Terms, Words, Phrases, Texts, CORPUS..

  The ability of correlating terms, or words, or phrases, or even whole texts is something that can be taught to computers in an efficient(more or less) way. Given a collection of documents of the same semantic context(corpus) a practical way to do this is by giving a semantic "weight" at each term of every single document. By measuring the "weights" of each term and confronting them, we can achieve a certain classification that can be further exploited in order to measure the semantic "distance" between terms. The weighting factor is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is called term frequency–inverse document frequency (tf–idf), which is a value that increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. This may look like a paradox, but actually it helps to control for the fact that some words are generally more common than others which means that they are less significant and can not be taken into consideration.

  • Term Frequency: Suppose for a document there are overall 5000 words and a word Term-Frequency occurs 5 times. Then , mathematically, its Term Frequency, TF = 5/5000 =0.001.
  • Inverse Document Frequency: Suppose one bought 1Q84 series, all series. Suppose the there are 9000000000000 total words and a word "senshei" comes 70000 times in all of the series. Then, mathematically, its Inverse-Document Frequency ,IDF = log(9000000000000/70000) = .......(i am sure it gives as a number.) 
         And finally,   TF/IDF = TF * IDF;

"To get to know each other.."

  The evaluation of the semantic proximity is done using the Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the tf–idf of the term. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter. The formula is given below and in our case A and B represent a document containing the respective tf-idf of the terms.

A more mathematical  explanation of the Cosine similarity and an example written in Java and Python can be found here.

-Just imagine a future in which machines will be able to conceptualize the meaning of death.

No comments:

Post a Comment