Google search, math and latent semantic analysis
By Murray Bourne, 13 Jul 2008
Google has become the dominant search engine because of its relevance and efficiency. Relevance is achieved through its propriety PageRank algorithm, which determines which pages are the most likely to satisfy your search query. Efficiency is achieved by using thousands of PCs rather than big servers to hold all the indexing, document and media information.
I wrote about this a while back in Math that made Google rich.
Now let's move on to an aspect of matrices that search engines use, called latent semantic analysis.
Here's what Wikipedia has to say on the subject:
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents, typically stemmed words that appear in the documents.
Let's put this in everyday language. Simply put, latent semantic indexing is something the search engines do when they analyze the content of a Web site in order to figure out what the site is about.
It's actually what we humans do every day of our lives — try to figure out the meaning in what we see, hear and feel.
That Wikipedia article delves into the matrix operations that are involved in latent semantic analysis.
(If you are a bit rusty, see an Introduction to Matrices.)
See the 4 Comments below.
21 Sep 2014 at 5:26 am [Comment permalink]
Many thanks for yourselves
I am a researcher and interested in measuring the similarity of the text and type currently LSA and I hope to take advantage of your experiences
12 Dec 2016 at 5:15 am [Comment permalink]
Hello,
Please, do you consider that the Google patent "phrase based indexing in an information retrieval system" (https://www.google.com/patents/US7536408) describes an lsi method? Thank you for your post.
PS: I built an application (php based) and called SEO Hero that query Google with a given keyword and extract the first 100 documents that rank on that query, then the application parses every single document to extract words and phrases before storing all these terms in a database with datas like, term frequency, document frequency for each entry. The main scope is to understand how much the words correlated to a given query, is it too much if describes this tool as a "latent semantic Explorer"? Thank you for any advice
12 Dec 2016 at 10:06 pm [Comment permalink]
Thanks for sharing your application. I'll try to look at it more closely when I have time.
I believe that Google patent would be the latent semantic index method, and it looks like it would be legitimate to call your app a "latent semantic explorer".
26 Dec 2016 at 11:52 pm [Comment permalink]
Hello Murray,
some people in the seo industry didn't like the name "latent semantic explorer". They said because it can mislead the user as it remember too much LSA/LSI.
Anyway, how we call it is not so important so I have decided to describe SEO Hero as a Topic Explorer.
I don't you if you had time to take a look, but thank you very much for your kindness