XMLFileIndexingWriter (DLESE Tools API Documentation v1.6.0)

Overview

Package

Class

Tree

Deprecated

Index

Help

DLESE Tools
v1.6.0

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.dlese.dpc.index.writer
Class XMLFileIndexingWriter

java.lang.Object
  org.dlese.dpc.index.writer.FileIndexingServiceWriter
      org.dlese.dpc.index.writer.XMLFileIndexingWriter

All Implemented Interfaces:: DocWriter

Direct Known Subclasses:: DleseAnnoFileIndexingServiceWriter, DleseCollectionFileIndexingWriter, ItemFileIndexingWriter, NCSCollectionFileIndexingWriter, NewsOppsFileIndexingWriter, SimpleXMLFileIndexingWriter

public abstract class XMLFileIndexingWriter
extends FileIndexingServiceWriter
extends FileIndexingServiceWriter

Creates a Lucene Document from any XML file by stripping the XML tags to extract and index the content. The reader for this type of Document is XMLDocReader.

The Lucene Document fields that are created by this class are (in addition the the ones listed for FileIndexingServiceWriter):

collection - The collection associated with this resource.

Author:: John Weatherley
See Also:: FileIndexingService, XMLDocReader

Constructor Summary
`XMLFileIndexingWriter()` Constructor for the XMLFileIndexingWriter.

Method Summary
`protected abstract String[]`	`_getIds()` Return unique IDs for the item being indexed, one for each collection that catalogs the resource.
`protected void`	`addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)` Adds the full content of the XML to the default search field.
`protected abstract void`	`addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)` Adds additional fields that are unique the document format being indexed.
`protected BoundingBox`	`getBoundingBox()` Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply.
`protected String[]`	`getCollections()` Returns unique collection keys for the item being indexed.
`org.apache.lucene.document.Document`	`getDeletedDoc(org.apache.lucene.document.Document existingDoc)` Creates a Lucene Document for the XML that is equal to the exsiting Document.
`abstract String`	`getDescription()` Return a description for the document being indexed, or null if none applies.
`String`	`getDocGroup()` Gets the collection specifier, for example 'dcc', 'comet'.
`protected Document`	`getDom4jDoc()` Gets the dom4j Document for use by sub-classes
`protected String`	`getFieldContent(String[] values, String useVocabMapping, String metadataFormat)` Gets the vocab encoded keys for the given values, separated by the '+' symbol.
`protected String`	`getFieldContent(String value, String useVocabMapping, String metadataFormat)` Gets the encoded vocab key for the given content.
`protected String`	`getFieldName(String vocabFieldString, String metadataFormat)` Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'.
`String[]`	`getIds()` Returns the ids for the item being indexed.
`protected SimpleLuceneIndex`	`getIndex()` Gets the index used by this XML File Indexer
`protected ResultDocList`	`getMyAnnoResultDocs()` Gets the annotations for this record, null or zero length if none available.
`protected DleseCollectionDocReader`	`getMyCollectionDoc()` Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.
`static String`	`getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc)` Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.
`String`	`getPrimaryId()` Returns the unique primary record ID for the item being indexed.
`protected RecordDataService`	`getRecordDataService()` Gets the recordDataService used by this XML File Indexer
`List`	`getRelatedIds()` Gets the ids of related records.
`Map`	`getRelatedIdsMap()` Gets the ids of related records.
`List`	`getRelatedUrls()` Gets the urls of related records.
`Map`	`getRelatedUrlsMap()` Gets the urls of related records.
`protected String`	`getTermStringFromStringArray(String[] vals)` Gets the appropriate terms from a string array of metadata fields.
`abstract String`	`getTitle()` Return a title for the document being indexed, or null if none applies.
`abstract String[]`	`getUrls()` Return the URL(s) to the resource being indexed, or null if none apply.
`protected abstract Date`	`getWhatsNewDate()` Returns the date used to determine "What's new" in the library, or null if none is available.
`protected abstract String`	`getWhatsNewType()` Returns the type of category for "What's new" in the library, or null if none is available.
`protected XMLIndexer`	`getXmlIndexer()` Gets the XMLIndexer for use by sub-classes
`protected XMLIndexerFieldsConfig`	`getXmlIndexerFieldsConfig()` Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.
`abstract boolean`	`indexFullContentInDefaultAndStems()` Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class.
`abstract void`	`init(File source, org.apache.lucene.document.Document existingDoc)` This method is called prior to processing and may be used to for any necessary set-up.

Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter
abortIndexing, addDocToRemove, addToAdminDefaultField, addToDefaultField, create, destroy, getConfigAttributes, getDocsource, getDocType, getFileContent, getFileIndexingPlugin, getFileIndexingService, getLuceneDoc, getPreviousRecordDoc, getReaderClass, getSessionAttributes, getSourceDir, getSourceFile, getValidationReport, isMakingDeletedDoc, isValidationEnabled, prtln, prtlnErr, setConfigAttributes, setDebug, setFileIndexingPlugin, setFileIndexingService, setIsMakingDeletedDoc, setValidationEnabled

Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter

abortIndexing, addDocToRemove, addToAdminDefaultField, addToDefaultField, create, destroy, getConfigAttributes, getDocsource, getDocType, getFileContent, getFileIndexingPlugin, getFileIndexingService, getLuceneDoc, getPreviousRecordDoc, getReaderClass, getSessionAttributes, getSourceDir, getSourceFile, getValidationReport, isMakingDeletedDoc, isValidationEnabled, prtln, prtlnErr, setConfigAttributes, setDebug, setFileIndexingPlugin, setFileIndexingService, setIsMakingDeletedDoc, setValidationEnabled

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

XMLFileIndexingWriter

public XMLFileIndexingWriter()

Constructor for the XMLFileIndexingWriter.

Method Detail

getIds

public String[] getIds()
                throws Exception

Returns the ids for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.

Returns:: The id String
Throws:: Exception - If error
See Also:: getIds()

getPrimaryId

public String getPrimaryId()
                    throws Exception

Returns the unique primary record ID for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.

Returns:: The id String
Throws:: Exception - If error
See Also:: getIds()

getRelatedIds

public List getRelatedIds()
                   throws IllegalStateException,
                          Exception

Gets the ids of related records.

Returns:: The related ids value, or null if none
Throws:: IllegalStateException - If called prior to calling method #indexFields; Exception - If error

getRelatedUrls

public List getRelatedUrls()
                    throws IllegalStateException,
                           Exception

Gets the urls of related records.

Returns:: The related urls value, or null if none
Throws:: IllegalStateException - If called prior to calling method #indexFields; Exception - If error

getRelatedIdsMap

public Map getRelatedIdsMap()
                     throws IllegalStateException,
                            Exception

Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.

Returns:: The related ids value, or null if none
Throws:: IllegalStateException - If called prior to calling method #indexFields; Exception - If error

getRelatedUrlsMap

public Map getRelatedUrlsMap()
                      throws IllegalStateException,
                             Exception

Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.

Returns:: The related urls value, or null if none
Throws:: IllegalStateException - If called prior to calling method #indexFields; Exception - If error

getCollections

protected String[] getCollections()
                           throws Exception

Returns unique collection keys for the item being indexed. For example "dcc" (single collection) or "dcc dwel" (multiple collections). If more than one collection is provided, the first one must be the primary collection. May be overridden by sub-classes as appropriate (overridden by ADNFileIndexingWriter).

Returns:: The collection keys
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDocGroup

public String getDocGroup()
                   throws Exception

Gets the collection specifier, for example 'dcc', 'comet'.

Specified by:: getDocGroup in class FileIndexingServiceWriter

Returns:: The collection specifier
Throws:: Exception - If error occured

getBoundingBox

protected BoundingBox getBoundingBox()
                              throws Exception

Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply. Override if nessary.

Returns:: BoundingBox, or null
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

init

public abstract void init(File source,
                          org.apache.lucene.document.Document existingDoc)
                   throws Exception

This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs.

Specified by:: init in class FileIndexingServiceWriter

Parameters:: source - The source file being indexed; existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
Throws:: Exception - If an error occured during set-up.

_getIds

protected abstract String[] _getIds()
                             throws Exception

Return unique IDs for the item being indexed, one for each collection that catalogs the resource. For example "DLESE-000-000-000-001" (single ID) or "DLESE-000-000-000-036 COMET-60" (multiple IDs). If more than one ID is present, the first one is the primary.

Returns:: The id(s)
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getTitle

public abstract String getTitle()
                         throws Exception

Return a title for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'title' and is also indexed in the 'default' field.

Returns:: The title String
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDescription

public abstract String getDescription()
                               throws Exception

Return a description for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'description' and is also indexed in the 'default' field.

Returns:: The description String
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getUrls

public abstract String[] getUrls()
                          throws Exception

Return the URL(s) to the resource being indexed, or null if none apply. If more than one URL references the resource, the first one is the primary. The URL Strings are tokenized and indexed under the field key 'uri' and is also indexed in the 'default' field. It is also stored in the index untokenized under the field key 'url.'

Returns:: The url String(s)
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

indexFullContentInDefaultAndStems

public abstract boolean indexFullContentInDefaultAndStems()

Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class. If true, the content is indexed using the #addToDefaultField method.

Returns:: True to have the full XML content indexed in the 'default' and 'stems'

getWhatsNewDate

protected abstract Date getWhatsNewDate()
                                 throws Exception

Returns the date used to determine "What's new" in the library, or null if none is available.

Returns:: The what's new date for the item or null if not available.
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getWhatsNewType

protected abstract String getWhatsNewType()
                                   throws Exception

Returns the type of category for "What's new" in the library, or null if none is available. Must be a simple lower case String with no spaces, for example 'itemnew,' 'itemannocomplete,' 'itemannoinprogress,' 'annocomplete,' 'annoinprogress,' 'collection'.

Returns:: The what's new type.
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

addFields

protected abstract void addFields(org.apache.lucene.document.Document newDoc,
                                  org.apache.lucene.document.Document existingDoc,
                                  File sourceFile)
                           throws Exception

Adds additional fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.

The following Lucene Field types are available for indexing with the Document:
Field.Text(string name, string value) -- tokenized, indexed, stored
Field.UnStored(string name, string value) -- tokenized, indexed, not stored
Field.Keyword(string name, string value) -- not tokenized, indexed, stored
Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want

Example code:
protected void addCustomFields(Document newDoc, Document existingDoc) throws Exception {
String customContent = "Some content";
newDoc.add(Field.Text("mycustomefield", customContent));
}

Parameters:: newDoc - The new Document that is being created for this resource; existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present; sourceFile - The sourceFile that is being indexed
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

addCustomFields

protected void addCustomFields(org.apache.lucene.document.Document newDoc,
                               org.apache.lucene.document.Document existingDoc,
                               File sourceFile)
                        throws Exception

Adds the full content of the XML to the default search field. Strips the XML tags to extract the content. Will not work properly if the XML is not well-formed.

Specified by:: addCustomFields in class FileIndexingServiceWriter

Parameters:: newDoc - The new Document that is being created for this resource; existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present; sourceFile - The feature to be added to the CustomFields attribute
Throws:: Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDeletedDoc

public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc)
                                                  throws Throwable

Creates a Lucene Document for the XML that is equal to the exsiting Document.

Overrides:: getDeletedDoc in class FileIndexingServiceWriter

Parameters:: existingDoc - An existing FileIndexingService Document that currently resides in the index for the given file
Returns:: A Lucene FileIndexingService Document
Throws:: Throwable - Thrown if error occurs

getMyAnnoResultDocs

protected ResultDocList getMyAnnoResultDocs()
                                     throws Exception

Gets the annotations for this record, null or zero length if none available.

Returns:: The myAnnoResultDocs value
Throws:: Exception - NOT YET DOCUMENTED

getXmlIndexerFieldsConfig

protected XMLIndexerFieldsConfig getXmlIndexerFieldsConfig()

Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.

Returns:: The xmlIndexerFieldsConfig value

getFieldContent

protected String getFieldContent(String[] values,
                                 String useVocabMapping,
                                 String metadataFormat)
                          throws Exception

Gets the vocab encoded keys for the given values, separated by the '+' symbol.

Parameters:: values - The valuse to encode.; useVocabMapping - The mapping to use, for example "contentStandards".; metadataFormat - The metadata format, for example 'adn'
Returns:: The encoded vocab keys.
Throws:: Exception - If error.

getFieldContent

protected String getFieldContent(String value,
                                 String useVocabMapping,
                                 String metadataFormat)
                          throws Exception

Gets the encoded vocab key for the given content.

Parameters:: value - The value to encode; useVocabMapping - The vocab mapping to use, for example 'contentStandard'; metadataFormat - The metadata format, for example 'adn'
Returns:: The encoded value, or unchanged if unable to encode
Throws:: Exception - If error

getFieldName

protected String getFieldName(String vocabFieldString,
                              String metadataFormat)
                       throws Exception

Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'. If unable to get the field ID, the vocab field String is returned unchanged.

Parameters:: vocabFieldString - The field, for example 'gradeRange'; metadataFormat - The metadata format, for example 'adn'
Returns:: The field key, for example 'gr', or unchanged if unable to determine
Throws:: Exception - If error

getTermStringFromStringArray

protected String getTermStringFromStringArray(String[] vals)

Gets the appropriate terms from a string array of metadata fields. Uses all terms found after the last colon ":" found in the string.

Parameters:: vals - Metadata fields that must be delemited by colons.
Returns:: The individual terms used for indexing.

getXmlIndexer

protected XMLIndexer getXmlIndexer()
                            throws Exception

Gets the XMLIndexer for use by sub-classes

Returns:: The XMLIndexer
Throws:: Exception - If error

getDom4jDoc

protected Document getDom4jDoc()
                        throws Exception

Gets the dom4j Document for use by sub-classes

Returns:: The Document
Throws:: Exception - If error

getMyCollectionDoc

protected DleseCollectionDocReader getMyCollectionDoc()

Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.

Returns:: The myCollectionDoc value

getOaiModtime

public static final String getOaiModtime(File sourceFile,
                                         org.apache.lucene.document.Document existingDoc)

Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.

Parameters:: sourceFile - The source file; existingDoc - The existing Doc
Returns:: The oaiModtime value

getRecordDataService

protected RecordDataService getRecordDataService()

Gets the recordDataService used by this XML File Indexer

Returns:: The recordDataService, or null if not available.

getIndex

protected SimpleLuceneIndex getIndex()

Gets the index used by this XML File Indexer

Returns:: The index, or null if not available.

Overview

Package

Class

Tree

Deprecated

Index

Help

DLESE Tools
v1.6.0

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.dlese.dpc.index.writer Class XMLFileIndexingWriter

XMLFileIndexingWriter

getIds

getPrimaryId

getRelatedIds

getRelatedUrls

getRelatedIdsMap

getRelatedUrlsMap

getCollections

getDocGroup

getBoundingBox

init

_getIds

getTitle

getDescription

getUrls

indexFullContentInDefaultAndStems

getWhatsNewDate

getWhatsNewType

addFields

addCustomFields

getDeletedDoc

getMyAnnoResultDocs

getXmlIndexerFieldsConfig

getFieldContent

getFieldContent

getFieldName

getTermStringFromStringArray

getXmlIndexer

getDom4jDoc

getMyCollectionDoc

getOaiModtime

getRecordDataService

getIndex

org.dlese.dpc.index.writer
Class XMLFileIndexingWriter