Skip to content

Regex EntityConnector: encoding bug

DESCRIPTION

When the Regex entity connector is placed before the Tika connector, there may be a data encoding problem in the stored field if the source file is not encoded in UTF-8. In this example we extract the document title with the Regex Connector on a XML file encoded in ISO-8859-1:

image.png

The title is not correctly displayed, although the snippet does display special characters. This is because Tika correctly decodes the file to extract the snippet.

VERSION CONCERNED

6.0

CHECKLIST BEFORE CLOSING TICKET

  • Documentation
    • I have created the functional documentation in the wiki: N/A
    • I have created the technical documentation in the wiki: N/A
    • I have added javadoc comments on key functions in my code
  • Security
    • I have cleaned up any input coming from users: N/A (Since processing is done on the contents of the file received from a Repo Connector.)
    • I have not put any token APIs, passwords or the like in my code
    • I am not using 3rd party libraries that are deprecated or not maintained