Regex Entity Connector: trim extracted values
DESCRIPTION
We have to trim extracted values, because there is no use of keeping espaces before or after the value extracted.
Here's a use case where the result is bad if these spaces are present. Taking these 3 files with the text:
file 1
SUBJECT A ; SUBJECT B ; SUBJECT C
file 2
SUBJECT B;SUBJECT C
file 3
SUBJECT A ; SUBJECT B ; SUBJECT C
If we want to extract the text between ";" a simple regular expression is: ([^;]+) (to extract any groups of character which doesn't contain ";", as many time it matches). But the result will be:
for file 1:
" SUBJECT A ", " SUBJECT B " and " SUBJECT C"
for file 2:
"SUBJECT B" and "SUBJECT C"
If we extract these values in a multivalued field, say keywords, it will be filled with these values:
" SUBJECT A ", " SUBJECT B ", " SUBJECT C", "SUBJECT B" and "SUBJECT C"
In the Search UI the corresponding facet will look like this:
We have better result with this regex: \s*([^;]+)\s*. Adding \s* to the original regex, we try to remove spaces before and after the group, but it only works on starting spaces. It's often very difficult, if not impossible, to create a regular expression that removes spaces before or after extracted terms.
Given that there are very few cases where it would be useful to keep these spaces, we might as well make the connector more user friendly by trimming the extracted terms by default.
VERSION CONCERNED
Datafari 6.0
CHECKLIST BEFORE CLOSING TICKET
-
Documentation: -
I have created the functional documentation in the wiki -
I have created the technical documentation in the wiki: N/A -
I have added javadoc comments on key functions in my code: N/A
-
-
Security: N/A -
I have cleaned up any input coming from users -
I have not put any token APIs, passwords or the like in my code -
I am not using 3rd party libraries that are deprecated or not maintained
-
