Rosebrock's Index-Match Algorithm is used for matching a piece of text to a pre-existing table of text data and returning a score representing the relation of the table to the original text. The higher the score, the higher the relation between the text and the table. This proves useful in websites and web applications when a list of "recommended" or "alike" data needs to be generated.
Rosebrock's Index-Match Algorithm (RIMA) is used to compare a piece of text to a pre-existing table of text data and return a score representing the relation of the table to the original text. The higher the score, the higher the relation between the text and table. The algorithm is self-dependent - the data to populate the table comes from the indexing method of the algorithm, while the score comes from the matching method. The table must be generated first before any matching can be done. Principally, the actual equation for indexing and matching is the same except for a small change in the D value. (See section 2.1 - Equation)
RIMA proves useful when generating a list of "recommended" data for users. For example, if a user were to visit the New York Times website daily and save the text from the stories he/she liked, RIMA could be applied to each story text to index the saved stores (indexing method), and then again applied to any new stories that appeared on the New York Times website (matching method). The matching method would return a list of stories from the Times website that was relevant to the saved stories, which is essentially a "recommended list" of stories.
The algorithm has three methods: indexing, matching, and retrieving. Indexing and matching use the same equation, with the only change being the value of D. Retrieving is a simple statistical equation using a population/sample mean and standard deviation to generate a "threshold" score. Any data set w is considered to be significant if it's score is greater than the threshold value, χ.
F: sets of data
Fj: single data set, Fj = w
w: set of words
wi: single word from word set w
O(wi): occurrence of word wi
D: divisor expressed as ratio
S: derived score to represent w
for each w in F
S = (ΣO(wi)) / D
χ = μF + (α * σF)
The most complicated of the three methods is indexing - it is the key to all other methods. Without indexing, there is nothing to compare text to, therefore no results can be returned.
The first step in indexing is to calculate the size of F. How many sets of data are needed? These data sets will be used to generate the table the matching method uses. Continuing with the NYT example, one may ask "how many stories are needed to ensure meaningful statistical results?" This data set would be considered the list of stories the user has saved from the NYT website. Of course, the rules of statistics are applied - the more data sets the better, but in many cases this is either infeasible or impossible. If that is the case, then try to gather as many data sets as possible. The relevancy of the algorithm results increases greatly with the number of indexable data sets.
After the size of F is calculated, retrieve the data in whatever fashion necessary, split the data into words, and then sanitize it by removing stop words and punctuation (Punctuation may or may not be removed - in most cases it should, for the Porter-Stemmer algorithm is later applied). A character is considered to be "punctuation" if it's ASCII value does not fall within the decimal range of 48 to 57 (0-9), 65 to 90 (A-Z), or 97 to 122 (a-z). Stop words are words that have no descriptive meaning to natural language processors. Examples include "I", "and", "the", "because", "therefore", etc. mySQL implements a very effective list of stop words which serves as a basis for removing needless words.[1]
The next step is to apply the Porter-Stemmer algorithm - an algorithm used for suffix stripping.[2] Removing suffixes ensures that only the base of the word is left. For example, the words "connect", "connected", "connecting", "connection" and "connections" have very similar meanings. Applying the stemmer algorithm to all of these words returns the single term "connect". Not only does this ensure that more accurate results are returned, but it lowers the size and complexity of the generated table, which in turn speeds up the algorithm.
Any word that the PorterStemmer algorithm returns with a length greater than three should be added to the table. The table is only two columns, with the first column being a word value and the second being the word's occurrence. If the word is not in table, add it to the table and set the occurrence to 1. If the word is already in the table, simply increment the occurrence by 1.
For every F, and every wi in w, apply the steps above and generate the table.
Assume the following sentences are the F data sets.
F1: When I find myself in times of trouble, mother Mary comes to me,
F2: speaking words of wisdom, let it be.
F3: And in my hour of darkness she is standing right in front of me,
F4: speaking words of wisdom, let it be.
After applying the above steps the generated table would look like this:
| word | occurrence |
|---|---|
| speak | 2 |
| word | 2 |
| wisdom | 2 |
| find | 1 |
| time | 1 |
| troubl | 1 |
| mother | 1 |
| mari | 1 |
| hour | 1 |
| dark | 1 |
| stand | 1 |
| front | 1 |
The last step is to calculate the mean and standard deviation for χ and the mean for the size of words in Fj. The sample/population must come from the same F used for indexing. Either use the entire F or a sample.
To start, once again split the data into words, sanitize the data, and apply the Porter-Stemmer algorithm; but instead of generating a table, the RIMA equation should to be used instead. Since this is the indexing and not the matching portion of the algorithm, D should be set to 1 over 1. An example of one iteration of the summation portion of the equation would look like this:
S = (Σ1 + 1 + ...) / 1
Calculating the score for all of F returns the following values:
F1: 5
F2: 6
F3: 4
F4: 6
Counting the size values (word count of Fj) returns:
F1: 13
F2: 7
F3: 14
F4: 7
After finding the mean and standard deviation of the score, and the the mean of the word count the data table for χ can be created:
| occurrence μ | occurrence σ | word count μ |
|---|---|---|
| 5.25 | 0.96 | 10.25 |
A summary of the indexing method:
Matching, while very similar to indexing, requires a different F and a change to the D value. In the NYT example, the F set would be the current stories on the NYT websites - the score for each of these stories is going to be calculated by the RIMA equation.
Follow the same steps as the indexing algorithm, only instead of incrementing table values, look up the stemmed word in the table. If the word is in the table, use the occurrence value; if the word is not in the table use 0 in the summation process.
The D value is simply the number of words in Fj divided by the word count μ. Rationally, this makes sense when compared to the indexing method. When indexing, the ratio has to be 1 - each data set, Fj, cannot have any weight associated with it: they are the original sets of data used to generate a word table and statistical values. Since there is no data prior to them, trying to associate any type of weight other than a divisor of 1 would be frivolous and the equation would lose it's mathematical relevancy. However, during matching, the ratio can be changed to accommodate for the number of words in each Fj. If a set of text only contains 5 words, but each and every one of the 5 words is present in the table with a high occurrence, the text should be weighted heavier than a story that has 100 words, but only 10 found in the table. Dividing by the word mean μ ensures a weight is given so larger pieces of text data are represented more fairly, while smaller pieces of text data are not automatically dismissed.
Given F1: I speak words of wisdom everday.
The equation would look like:
S = (ΣO(speak) + O(word) + O(wisdom) + O(everydai)) / (6 / 10.25)
S = (Σ2 + 2 + 2 + 0) / (0.58)
S = 10.34
Therefore, the score of the text "I speak words of wisdom everyday" when compared to the previously generated table is 10.34.
A summary of the matching method:
The last, and easiest step is to determine if the score from data set Fj is significant or not. To determine this, a value of χ must be calculated first. Recall that χ is:
χ = μF + (α * σF)
The α value is the number of standard deviations to multiply by, assuming the data is normally distributed. α will vary from application to application, but following the law of Chebyshev's theorem, at least 75% of the data will fall in the interval &mu - 2σ to μ + 2σ. Considering the words at least are used, a relatively low value of α should be used such as 1 or lower. Remember, this is effectively a right-tailed test, so only the positive interval is used.
To finish up the example, the "threshold" value of χ would be:
χ = 5.25 + (1 * 0.96)
χ = 6.21
Since the threshold is 6.21 and the score of the text "I speak words of wisdom everyday" is 10.34, the text would be considered significant and thus would qualify as a "recommended piece of data".
As noted above, the primary purpose of Rosebrock's Index-Match Algorithm is for text comparison and generating a list of "recommended" data; however, there are many more applications. The algorithm could be implemented for a system to detect plagiarism, where the program is not necessarily concerned about direct, blatant copies, but rather the key concepts described in the document.
In an online world progressively moving towards a semantic web, all new websites are adopting Web 2.0 concepts. This means users are now generating content, networking with other users that share their unique tastes, and overall, more and more people are becoming connected; though this comes with a price - as the amount of data grows on a website, the less and less likely that a user will be able to find content that interests him/her. Using RIMA, web programmers can now provide recommended lists of content to users.
Status quo should never stifle attempts at innovation that push the web forward, especially methods that bridge the gap between human and machine. Web 2.0 was once considered a buzzword, a fad of design and user democracy, but from it the semantic web has evolved. To push this evolution further, use RIMA in any type of application where the user experience is dependent on quality of content - the easier it is for a user to find relevant data, the more likely the user will frequent the website.
1. "Full-Text Stopwords." mySQL :: mySQL 5.0 Reference Manual. mySQL. 10 May 2008
2. M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130-137.