Data sources contain data items.
A data item is anything from which one or more keys can be extracted:
A data source may be implemented as:
Ecore resources and resource factories may be used as an abstraction layer to represent all types of data sources and data items as Ecore resources in which all elements are identified by URI’s:
gitlab
URI scheme may load from a GitLab server using REST API without cloning repositories, maven
URI handler may load from Maven repositories/archives.md
extension would be treated as a Markdown file - convert to HTML and then use HTML loader to convert to some internal implementation. Nasdanika provides resource factories for Drawio diagrams and Excel files.html
and htm
extensions would be handled by an HTML factory which may parse HTML using Jsoup and then structure HTML contents into sections using H
tag hierarchy, then to paragraphs and sentences. It may compute cross-references between files (resources) and parts of the document. These cross-references may be taken into account when computing similarity.Ecore resource set do not provide functionality for iterating over different storage systems, they load resources from URI’s using resource factories.
As such, the data source ecosystem would include the following:
Iterable<URI>
1The data sources ecosystem doesn’t have to be RAG-specific - it can be used for other purposes as well. For example, for reasoning. Reasoning and RAG might be combined with RAG/AI rules in which conclusions may be used as prompts/chains of thought.
common
module for common functionality.Use String for sentences. For plain text use List of strings for paragraphs.
Resources shall implement lookup by URI fragment. For example, line/column number for plain text and markdown, CSS path for HTML, TBD for PDF.