October, 2007.

Python File Indexer



pyIndex is a python script and a database configuration that allows to index files. It was originally developped for a source code search engine for the SAMATE Reference Dataset.
By the way, the method uses a MySQL database explained bellow for the storage of the words and the references. The script allows you to do a lazy indexing (index only a given directory) or a full directory indexer.
When I say 'directory' I mean ID based directory:
ID = 4242
the directroy is: ./000/004/242/*.*

Installing pyIndex

For installing this script, you only need to have MySQLdb and setup the database which should be like that:
-- Words storage database
CREATE TABLE `words` (
  `WordID` int(10) NOT NULL auto_increment,
  `Word` text collate latin1_general_ci NOT NULL,
  PRIMARY KEY  (`WordID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

-- Relation storage database
CREATE TABLE `words2id` (
  `WordsCrossID` int(10) NOT NULL auto_increment,
  `WordsID` int(10) NOT NULL,
  `ID` int(10) NOT NULL,
  PRIMARY KEY  (`WordsCrossID`),
  UNIQUE KEY `WordsID` (`WordsID`,`ID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

How do I use pyIndex ?

python pyIndex.py <ID/build/rebuild>

This command adds the values in the databases (words etc.) then, for using the results, you only have to perform a simple SQL query such as:
SELECT t.ID FROM words2id as t, words as w WHERE w.Word  LIKE '%SEARCH_WORD%' AND t.ID = w.WordID GROUP BY t.ID

You can change the test of the word such as
 w.Word SOUNDS LIKE 'SEARCH_WORD'
etc.

Download pyIndex

Download pyIndex

Romain Gaucher - r@rgaucher.info