index.py
Can be used to index large amounts of data in a reliable way.
Usage
$ python3 -m utils.index --help
usage: index.py [-h] [--mode {prepare,index,prepare-and-index}]
[--ursadb URSADB] [--workdir WORKDIR] [--batch BATCH]
[--path PATH] [--path-mount PATH_MOUNT]
[--max-file-size-mb MAX_FILE_SIZE_MB]
[--type {gram3,text4,hash4,wide8}] [--tag TAGS]
[--workers WORKERS] [--working-datasets WORKING_DATASETS]
Reindex local files.
optional arguments:
-h, --help show this help message and exit
--mode {prepare,index,prepare-and-index}
Mode of operation. Only prepare batches, index them,
or both.
--ursadb URSADB URL of the ursadb instance.
--workdir WORKDIR Path to a working directory.
--batch BATCH Size of indexing batch.
--path PATH Path of samples to be indexed.
--path-mount PATH_MOUNT
Path to the samples to be indexed, as seen by ursadb
(if different).
--max-file-size-mb MAX_FILE_SIZE_MB
Maximum file size, in MB, to index. 128 By default.
--type {gram3,text4,hash4,wide8}
Index types. By default [gram3, text4, wide8, hash4]
--tag TAGS Additional tags for indexed datasets.
--workers WORKERS Number of parallel indexing jobs.
--working-datasets WORKING_DATASETS
Numer of working datasets (uses sane value by
default).
Example
Probably the most complex script shipped with mquery. See indexing guide for complete a tutorial. Basic usage is relatively simple though. To index files with ursadb running natively, run:
$ python3 -m utils.index --workdir /tmp/work --path ../samples --path-mount /mnt/samples
ERROR:root:Can't connect to ursadb instance at tcp://localhost:9281
INFO:root:Prepare.1: load all indexed files into memory.
INFO:root:Prepare.2: find all new files.
INFO:root:Prepare.3: Got 1 files in 1 batches to index.
INFO:root:Index.1: Determine compacting threshold.
INFO:root:Index.1: Compact threshold = 84.
INFO:root:Index.2: Find prepared batches.
INFO:root:Index.2: Got 1 batches to run.
INFO:root:Index.3: Run index commands with 2 workers.
INFO:root:Index.4: Batch /tmp/work/batch_0000000000.txt done [1/1].
INFO:root:Index.5: Unlinking the workdir.
INFO:root:Indexing finished. Consider compacting the database now
Caveats
This script can be stopped with Ctrl+C at any point, but the last started started indexing batch will continue.
Don’t set --workers
parameter to a number too big! It can cause OOM crashes.