FileListFilter
A FileListFilter allows to filter files that can be processed by the connector. Files that are filtered out are simply
ignored and remain untouched on the file system until the next file listing operation. During the next execution, the
previously filtered files will be evaluated again to determine whether they should be processed.
You can configure multiple FileListFilters using the following connector’s configuration property:
| Configuration | Description | Type | Default |
|---|---|---|---|
fs.listing.filters | A comma-separated list of FileListFilter classes used to list eligible input files. | list | - |
FilePulse provides several built-in FileListFilter:
IgnoreHiddenFileFilter
You can use the IgnoreHiddenFileFilter to ignore hidden files.
Configuration example
fs.listing.filters=io.streamthoughts.kafka.connect.filepulse.fs.filter.IgnoreHiddenFileListFilter
Limitation
ThisIgnoreHiddenFileFilter can only be used when the LocalFSDirectoryListing is configured.LastModifiedFileFilter
You can use the LastModifiedFileFilter to filter only the files that have not been modified since a given duration.
fs.listing.filters=io.streamthoughts.kafka.connect.filepulse.fs.filter.LastModifiedFileFilter
# The last modified time for a file can be accepted (default: 5000)
file.filter.minimum.age.ms=10000
RegexFileFilter
You can use the RegexFileFilter to filter files that match a given regular expression.
fs.listing.filters=io.streamthoughts.kafka.connect.filepulse.fs.filter.RegexFileListFilter
# The regex pattern used to match input files
file.filter.regex.pattern="\\.log$"
SizeFileListFilter
You can use the SizeFileListFilter to filter files that are smaller or larger than a specific byte size.
fs.listing.filters=io.streamthoughts.kafka.connect.filepulse.fs.filter.RegexFileListFilter
file.filter.minimum.size.bytes=0
file.filter.maximum.size.bytes=9223372036854775807
DateInFilenameFileListFilter
You can use the DateInFilenameFileListFilterto filter files that contain a date in filename earlier or later than
a specific date.
fs.listing.filters=io.streamthoughts.kafka.connect.filepulse.fs.filter.DateInFilenameFileListFilter
file.filter.date.regex.extractor.pattern="^.*(\\d{4}-\\d{2}-\\d{2})_.*$"
file.filter.date.formatter.pattern="yyyy MM dd"
file.filter.date.min.date="2023-08-23"
file.filter.date.max.date="2023-08-24"