Skip to content

Supported File Types

Documents

ExtensionExtractorOptional DependencyNotes
.txtTextExtractorPlain text
.mdTextExtractorMarkdown (structure-aware chunking)
.pdfPdfExtractorUses PyMuPDF
.docxDocxExtractorMicrosoft Word
.htmlTextExtractorHTML (tags stripped)

Code

ExtensionExtractorOptional DependencyNotes
.pyCodeExtractorPython
.jsCodeExtractorJavaScript
.tsCodeExtractorTypeScript
.goCodeExtractorGo
.rsCodeExtractorRust
.javaCodeExtractorJava
.cCodeExtractorC
.cppCodeExtractorC++
.rbCodeExtractorRuby

Data

ExtensionExtractorOptional DependencyNotes
.csvSpreadsheetExtractorComma-separated values
.tsvSpreadsheetExtractorTab-separated values
.xlsxSpreadsheetExtractorlocallens[parsing]Excel (requires openpyxl)
.xlsSpreadsheetExtractorlocallens[parsing]Legacy Excel
.pptxLiteParseExtractorlocallens[parsing]PowerPoint (requires liteparse)

Email

ExtensionExtractorOptional DependencyNotes
.emlEmailExtractorlocallens[email]Standard email format
.msgEmailExtractorlocallens[email]Outlook message format

Books

ExtensionExtractorOptional DependencyNotes
.epubEpubExtractorlocallens[ebooks]EPUB e-books

Adding a custom extractor

LocalLens uses Python entry points for extractor plugins. Create a class that extends LocalLensExtractor:

python
from pathlib import Path
from locallens.extractors.base import LocalLensExtractor

class MyExtractor(LocalLensExtractor):
    def supported_extensions(self) -> list[str]:
        return [".xyz"]

    def name(self) -> str:
        return "my-extractor"

    def extract(self, file_path: Path) -> str:
        return file_path.read_text()

Register it as an entry point in your package's pyproject.toml:

toml
[project.entry-points."locallens.extractors"]
my_extractor = "my_package.extractor:MyExtractor"

Released under the MIT License.