DLESE Tools
v1.6.0

org.dlese.dpc.index.writer
Class FileIndexingServiceWriter

java.lang.Object
  extended by org.dlese.dpc.index.writer.FileIndexingServiceWriter
All Implemented Interfaces:
DocWriter
Direct Known Subclasses:
ErrorFileIndexingWriter, XMLFileIndexingWriter

public abstract class FileIndexingServiceWriter
extends Object
implements DocWriter

Abstract class for creating customized Lucene Documents for different file formats such as DLESE-IMS, ADN-item, ADN-collection, etc. Concrete sub-classes may be used with a FileIndexingService to enable automatic updating of the index whenever changes in the source file are made. This class, along with the FileIndexingService, may be used with a SimpleLuceneIndex to provide simple search support over files.

Note: after creating a new concrete FileIndexingServiceWriter, add a switch in RepositoryManager, method putDirInIndex(DirInfo, String) to select it for indexing.


The Lucene fields that are created by this class are:

Author:
John Weatherley

Constructor Summary
FileIndexingServiceWriter()
           
 
Method Summary
protected  void abortIndexing()
          Aborts the indexing process by returning a null index document.
protected abstract  void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document previousRecordDoc, File sourceFile)
          Adds additional custom fields that are unique the document format being indexed.
protected  void addDocToRemove(String field, String value)
          Removes a matching item from the index during the FileIndexingService update.
protected  void addToAdminDefaultField(String value)
          Adds the given String to a text field referenced in the index by the field name 'admindefault'.
protected  void addToDefaultField(String value)
          Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively.
 FileIndexingServiceData create(File sourceFile, org.apache.lucene.document.Document existingLuceneDoc, FileIndexingPlugin plugin, HashMap sessionAttr)
          Creates the Lucene Document for the given resource or returns null if unable to create.
protected abstract  void destroy()
          This method is called at the conclusion of processing and may be used for tear-down.
 HashMap getConfigAttributes()
          Gets the configuration attributes that were set when the writer was created.
 org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document previousRecordDoc)
          Creates a Lucene Document equal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.
abstract  String getDocGroup()
          Gets the specifier associated with this group of files or null if no group association exists.
 String getDocsource()
          Gets the absolute path to the file, which is indexed under the 'docsource' field.
abstract  String getDocType()
          Gets a unique document type key for this kind of record, corresponding to the format type.
 String getFileContent()
          Gets the full content of the file as a String.
 FileIndexingPlugin getFileIndexingPlugin()
          Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.
 FileIndexingService getFileIndexingService()
          Gets the fileIndexingService attribute of the FileIndexingServiceWriter object
 org.apache.lucene.document.Document getLuceneDoc()
          Gets the Lucene Document that this Writer is building.
 org.apache.lucene.document.Document getPreviousRecordDoc()
          Gets the previous Document that currently resides in the index for the given resource, or null if none was previously present.
abstract  String getReaderClass()
          Gets the fully qualified name of the concrete DocReader class that is used to read this type of Document, for example "org.dlese.dpc.index.reader.ItemDocReader".
 HashMap getSessionAttributes()
          Gets a Map of attributes used in a single indexing session.
 File getSourceDir()
          Gets the sourceDir that holds the file being indexed.
 File getSourceFile()
          Gets the sourceFile that is being indexed.
protected  String getValidationReport()
          Gets a report detailing any errors found in the validation of the file, or null if no error was found.
abstract  void init(File source, org.apache.lucene.document.Document previousRecordDoc)
          This method is called prior to processing and may be used to for any necessary set-up.
protected  boolean isMakingDeletedDoc()
          True if the current execution represents a deleted doc is being created.
 boolean isValidationEnabled()
          Returns true if the files being indexed should be validated, otherwise false.
protected  void prtln(String s)
          Output a line of text to standard out, with datestamp, if debug is set to true.
protected  void prtlnErr(String s)
          Output a line of text to error out, with datestamp.
 void setConfigAttributes(HashMap attributes)
          Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.
static void setDebug(boolean db)
          Sets the debug attribute of the FileIndexingServiceWriter object
 void setFileIndexingPlugin(FileIndexingPlugin plugin)
          Sets the FileIndexingPlugin that will be used during the indexing process to index additional fields.
 void setFileIndexingService(FileIndexingService fileIndexingService)
          Sets the fileIndexingService attribute of the FileIndexingServiceWriter object
protected  void setIsMakingDeletedDoc(boolean isMakingDeletedDoc)
          Sets whether this DocWriter is making a deleted document.
 void setValidationEnabled(boolean validateFiles)
          Sets whether or not to validate the files being indexed and create a validation report, which is indexed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FileIndexingServiceWriter

public FileIndexingServiceWriter()
Method Detail

getDocType

public abstract String getDocType()
                           throws Exception
Gets a unique document type key for this kind of record, corresponding to the format type. In the DLESE metadata repository, this corresponds to the XML format, for example "oai_dc," "adn," "dlese_ims," or "dlese_anno". The string is parsed using the Lucene StandardAnalyzer so it must be lowercase and should not contain any stop words.

Specified by:
getDocType in interface DocWriter
Returns:
The docType String
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDocGroup

public abstract String getDocGroup()
                            throws Exception
Gets the specifier associated with this group of files or null if no group association exists. In the DLESE metadata repository, this corresponds to the collection key, for example 'dcc', 'comet'.

Returns:
The docGroup specifier
Throws:
Exception - If error occured

getReaderClass

public abstract String getReaderClass()
Gets the fully qualified name of the concrete DocReader class that is used to read this type of Document, for example "org.dlese.dpc.index.reader.ItemDocReader".

Specified by:
getReaderClass in interface DocWriter
Returns:
The name of the DocReader.

init

public abstract void init(File source,
                          org.apache.lucene.document.Document previousRecordDoc)
                   throws Exception
This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs. The config attributes are set using the FileIndexingService.addDirectory(java.lang.String, java.lang.Class, java.util.HashMap, org.dlese.dpc.index.writer.FileIndexingPlugin, int) method.

Parameters:
source - The source file being indexed
previousRecordDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
Throws:
Exception - If an error occured during set-up.

destroy

protected abstract void destroy()
This method is called at the conclusion of processing and may be used for tear-down.


addCustomFields

protected abstract void addCustomFields(org.apache.lucene.document.Document newDoc,
                                        org.apache.lucene.document.Document previousRecordDoc,
                                        File sourceFile)
                                 throws Exception
Adds additional custom fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.

The following Lucene Field types are available for indexing with the Document:
Field.Text(string name, string value) -- tokenized, indexed, stored
Field.UnStored(string name, string value) -- tokenized, indexed, not stored
Field.Keyword(string name, string value) -- not tokenized, indexed, stored
Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want

Example code:
protected void addCustomFields(Document newDoc, Document previousRecordDoc) throws Exception {
  String customContent = "Some content";
  newDoc.add(Field.Text("mycustomefield", customContent));
}

Parameters:
newDoc - The new Document that is being created for this resource
previousRecordDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
sourceFile - The sourceFile that is being indexed
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getFileContent

public String getFileContent()
                      throws IOException
Gets the full content of the file as a String. If the file does not exist or the writer is processing a deleted doc, the content is pulled from the existing Lucene Document rather than the file.

Returns:
The full content of the file
Throws:
IOException - If error

getConfigAttributes

public HashMap getConfigAttributes()
Gets the configuration attributes that were set when the writer was created.

Returns:
The configuration attributes, or null if none were configured

setConfigAttributes

public void setConfigAttributes(HashMap attributes)
Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.

Parameters:
attributes - The configuration attributes

getSessionAttributes

public HashMap getSessionAttributes()
Gets a Map of attributes used in a single indexing session. A seesion is a portion of indexing for a given directory of records that will be added to the index as a block update. Since records are added to the index at the end of the session, the index can not be used to query information from those records during the session. Thus, these attributes can be used to communitcate information across records being indexed within a given session, such as the record IDs found so far in the session. The attributes are cleared at the end of each session.

Returns:
A Map of records IDs keys, or null

getSourceFile

public File getSourceFile()
Gets the sourceFile that is being indexed. Only available after create() has been called.

Returns:
The sourceFile value

getDocsource

public String getDocsource()
Gets the absolute path to the file, which is indexed under the 'docsource' field.

Returns:
The absolute path to the file

getSourceDir

public File getSourceDir()
Gets the sourceDir that holds the file being indexed. Only available after create() has been called.

Returns:
The sourceDir value

getLuceneDoc

public org.apache.lucene.document.Document getLuceneDoc()
Gets the Lucene Document that this Writer is building.

Returns:
The Lucene Document

getPreviousRecordDoc

public org.apache.lucene.document.Document getPreviousRecordDoc()
Gets the previous Document that currently resides in the index for the given resource, or null if none was previously present.

Returns:
The previousRecordDoc value

setFileIndexingService

public void setFileIndexingService(FileIndexingService fileIndexingService)
Sets the fileIndexingService attribute of the FileIndexingServiceWriter object

Parameters:
fileIndexingService - The new fileIndexingService.

getFileIndexingService

public FileIndexingService getFileIndexingService()
Gets the fileIndexingService attribute of the FileIndexingServiceWriter object

Returns:
The fileIndexingService.

isValidationEnabled

public boolean isValidationEnabled()
Returns true if the files being indexed should be validated, otherwise false. This method may be ignored by concrete classes if not needed.

Returns:
true if validateion is enabled.

setValidationEnabled

public void setValidationEnabled(boolean validateFiles)
Sets whether or not to validate the files being indexed and create a validation report, which is indexed. This value is set by the FileIndexingService prior to indexing. If true, the method getValidationReport() will be called, otherwise it will not.

Parameters:
validateFiles - True to validate, else false.
See Also:
getValidationReport(), FileIndexingService.setValidationEnabled(boolean validateFiles)

getValidationReport

protected String getValidationReport()
                              throws Exception
Gets a report detailing any errors found in the validation of the file, or null if no error was found. This method should be overridden by concrete classes that need to validate the underlying file before indexing. Otherwise, this default method will simply return null. This method is called after all other method calls.

Returns:
Null if no file validation errors were found, otherwise a String that details the nature of the error.
Throws:
Exception - If error.

addToDefaultField

protected void addToDefaultField(String value)
Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively. The default and stems fields may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.

Parameters:
value - A text string to be added to the indexed fields named 'default' and 'stems'

addToAdminDefaultField

protected void addToAdminDefaultField(String value)
Adds the given String to a text field referenced in the index by the field name 'admindefault'. The default field may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.

Parameters:
value - A text string to be added to the indexed field named 'admindefault.'

getDeletedDoc

public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document previousRecordDoc)
                                                  throws Throwable
Creates a Lucene Document equal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.

Design note: This method should be overwritten by subclasses that require more envolved logic for deletes, and this super method should be called first and then subclassed should check #getIsMakingDeletedDoc to execute as appropriate.

Parameters:
previousRecordDoc - An existing FileIndexingService Document that currently resides in the index for the given file
Returns:
A Lucene FileIndexingService Document with appropriate fields updated
Throws:
Throwable - Thrown if error occurs

setIsMakingDeletedDoc

protected void setIsMakingDeletedDoc(boolean isMakingDeletedDoc)
Sets whether this DocWriter is making a deleted document. Used by subclassed that crate a DocWriter in their getDeletedDoc method.

Parameters:
isMakingDeletedDoc - Sets the making deleted doc status

isMakingDeletedDoc

protected final boolean isMakingDeletedDoc()
True if the current execution represents a deleted doc is being created.

Returns:
True if a deleted doc is being created

abortIndexing

protected void abortIndexing()
Aborts the indexing process by returning a null index document.


addDocToRemove

protected void addDocToRemove(String field,
                              String value)
Removes a matching item from the index during the FileIndexingService update. This method should be called to instruct the indexer to remove documents that should no longer be in the index.

Parameters:
field - The field to search in.
value - The matching value for the item to remove.

create

public FileIndexingServiceData create(File sourceFile,
                                      org.apache.lucene.document.Document existingLuceneDoc,
                                      FileIndexingPlugin plugin,
                                      HashMap sessionAttr)
                               throws Throwable
Creates the Lucene Document for the given resource or returns null if unable to create. This method is called by class FileIndexingService.

Parameters:
sourceFile - The source file to be indexed
existingLuceneDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
plugin - The FileIndexingPlugin being used, or null
sessionAttr - Attributes used in a given indexing session
Returns:
A Lucene Document with it's fields populated, or null.
Throws:
Throwable - Thrown if error occurs

setFileIndexingPlugin

public void setFileIndexingPlugin(FileIndexingPlugin plugin)
Sets the FileIndexingPlugin that will be used during the indexing process to index additional fields. Set to null to remove.

Parameters:
plugin - A FileIndexingPlugin to use during indexing.

getFileIndexingPlugin

public FileIndexingPlugin getFileIndexingPlugin()
Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.

Returns:
The FileIndexingPlugin configured for use used, or null.

prtlnErr

protected final void prtlnErr(String s)
Output a line of text to error out, with datestamp.

Parameters:
s - The text that will be output to error out.

prtln

protected final void prtln(String s)
Output a line of text to standard out, with datestamp, if debug is set to true.

Parameters:
s - The String that will be output.

setDebug

public static final void setDebug(boolean db)
Sets the debug attribute of the FileIndexingServiceWriter object

Parameters:
db - The new debug value

DLESE Tools
v1.6.0