Developer Interface¶
Annotator Module¶
The annotator module
- spinneret.annotator.add_predicate_annotations_to_workbook(predicate: str, workbook: str | DataFrame, eml: str | _ElementTree, output_path: str = None, overwrite: bool = False, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False, sample_size: int = 1) DataFrame[source]¶
- Parameters:
predicate – The predicate label for the annotation. This guides the annotation process with which OntoGPT template to use. The options are: contains measurements of type, contains process, env_broad_scale, env_local_scale, environmental material, research topic, usesMethod, uses standard.
workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
output_path – The path to write the annotated workbook.
overwrite – If True, overwrite existing annotations in the workbook, so a fresh set may be created. Only annotations with the same predicate as the predicate input will be removed.
local_model – See get_ontogpt_annotation documentation for details.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – See get_ontogpt_annotation documentation for details.
sample_size – Executes multiple replicates of the annotation request to reduce variability of outputs. Variability is inherent in OntoGPT.
- Returns:
Workbook with predicate annotations.
- Notes:
This function retrieves annotations using OntoGPT, except for the uses standard which uses a deterministic method. OntoGPT requires setup and configuration described in the get_ontogpt_annotation function.
- spinneret.annotator.add_qudt_annotations_to_workbook(workbook: str | DataFrame, eml: str | _ElementTree, output_path: str = None, overwrite: bool = False) DataFrame[source]¶
- Parameters:
workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
output_path – The path to write the annotated workbook.
overwrite – If True, overwrite existing QUDT annotations in the `workbook, so a fresh set may be created.
- Returns:
Workbook with QUDT annotations.
- spinneret.annotator.annotate_eml(eml: str | _ElementTree, workbook: str | DataFrame, output_path: str = None) _ElementTree[source]¶
Annotate an EML file with terms from the corresponding workbook
- Parameters:
eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
workbook – Either the path to the workbook corresponding to the eml, or the workbook itself as a pandas DataFrame.
output_path – The path to write the annotated EML file.
- Returns:
The annotated EML file as an lxml etree.
- Notes:
The EML file is annotated with terms from the corresponding workbook. Terms from the workbook are added even if they are already present in the EML file.
- spinneret.annotator.annotate_workbook(workbook_path: str, eml_path: str, output_path: str, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False, sample_size: int = 1) None[source]¶
Annotate a workbook with automated annotation
- Parameters:
workbook_path – The path to the workbook to be annotated corresponding to the EML file.
eml_path – The path to the EML file corresponding to the workbook.
output_path – The path to write the annotated workbook.
local_model – See get_ontogpt_annotation documentation for details.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – See get_ontogpt_annotation documentation for details.
sample_size – Executes multiple replicates of the annotation request to reduce variability of outputs. Variability is inherent in OntoGPT.
- Returns:
None
- Notes:
The workbook is annotated by annotators best suited for the XPaths in the EML file. The annotated workbook is written back to the same path as the original workbook.
- spinneret.annotator.create_annotation_element(predicate_label, predicate_id, object_label, object_id)[source]¶
Create an EML annotation element
- Parameters:
predicate_label – The predicate label of the annotation.
predicate_id – The URI of the predicate.
object_label – The object label of the annotation.
object_id – The URI of the object.
- spinneret.annotator.get_annotation_from_workbook(workbook: str | DataFrame, element: str, description: str, predicate: str) list | None[source]¶
- Parameters:
workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
element – The element to retrieve annotations for.
description – The description of the element to retrieve annotations for.
predicate – The predicate to retrieve annotations for.
- Returns:
A list of dictionaries, each with the annotation keys label (same as object column in workbook), uri (same as object_id column in workbook). None if no annotations are found for the given element name.
- Notes:
This function returns existing annotations from the workbook if the element, description, and predicate match, and the object and object_id are not empty. This is useful when one or more data entities have several attributes of different names but the same meaning.
- spinneret.annotator.get_bioportal_annotation(text: str, api_key: str, ontologies: str, semantic_types: str = '', expand_semantic_types_hierarchy: str = 'false', expand_class_hierarchy: str = 'false', class_hierarchy_max_level: int = 0, expand_mappings: str = 'false', stop_words: str = '', minimum_match_length: int = 3, exclude_numbers: str = 'false', whole_word_only: str = 'true', exclude_synonyms: str = 'false', longest_only: str = 'false') list | None[source]¶
Get an annotation from the BioPortal API
- Parameters:
text – The text to be annotated.
api_key – The BioPortal API key.
ontologies – The ontologies to use for annotation.
semantic_types – The semantic types to use for annotation.
expand_semantic_types_hierarchy – true means to use the semantic types passed in the “semantic_types” parameter as well as all their immediate children. false means to use ONLY the semantic types passed in the “semantic_types” parameter.
expand_class_hierarchy – used only in conjunction with “class_hierarchy_max_level” parameter; determines whether or not to include ancestors of the given class when performing an annotation.
class_hierarchy_max_level – the depth of the hierarchy to use when performing an annotation.
expand_mappings – true means that the following manual mappings will be used in annotation: UMLS, REST, CUI, OBOXREF.
stop_words – a comma-separated list of words to ignore in the text.
minimum_match_length – the minimum number of characters in a term that must be matched in the text.
exclude_numbers – true means to exclude numbers from annotation.
whole_word_only – true means to match whole words only.
exclude_synonyms – true means to exclude synonyms from annotation.
longest_only – true means that only the longest match for a given phrase will be returned.
- Returns:
A list of dictionaries, each with the annotation keys label and uri, corresponding to the preferred label and URI of the annotated concept. None if the request fails.
- Notes:
This function is a wrapper for the BioPortal API. The BioPortal API is a repository of biomedical ontologies with a RESTful API that allows users to annotate text with ontology concepts. The API is documented at https://data.bioontology.org/documentation#nav_annotator.
This function requires an API key from BioPortal. To obtain an API key, users must register at https://bioportal.bioontology.org/account. The key can be loaded as an environment variable from the configuration file (see utilities.load_configuration).
- spinneret.annotator.get_ontogpt_annotation(text: str, template: str, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False) list | None[source]¶
- Parameters:
text – The text to be annotated.
template – Name of OntoGPT template to use for grounding. Available templates are in src/data/ontogpt/templates. Omit the file extension.
local_model – The local language model to use (e.g. llama3.2). This should be one of the options available from ollama (see https://ollama.com/library) and should be installed locally. If None, the configured remote model will be used. See the OntoGPT documentation for more information.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – If True, return ungrounded annotations. These may be useful in identifying potential concepts to add to a vocabulary, or to identify concepts that a human curator may be capable of grounding.
- Returns:
A list of dictionaries, each with the annotation keys label and uri. None if the request fails or no annotations are found.
- Notes:
This function is a wrapper for the OntoGPT API. Set up of OntoGPT is required to use this function. For more information, see: https://monarch-initiative.github.io/ontogpt/.
- spinneret.annotator.get_qudt_annotation(text: str) list | None[source]¶
Get an annotation from the QUDT API
- Parameters:
text – The text to be annotated. This should be the value from the EML standardUnit or customUnit element.
- Returns:
A list of dictionaries, each with the annotation keys label and uri, corresponding to the preferred label and URI of the annotated concept. None if the request fails.
- Notes:
This function queries the Unit Annotations Service https://vocab.lternet.edu/unitsws.html, developed by the EDI and LTER units working group, for a match of the input text to a QUDT unit via the service mapping.
- spinneret.annotator.has_annotation(workbook: str | DataFrame, element_xpath: str, predicate: str) bool[source]¶
- Parameters:
workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
element_xpath – The XPath of the element to check for annotations.
predicate – The predicate to check for annotations.
- Returns:
True if the workbook contains an element_xpath that has an annotation for the given predicate. False otherwise.
Benchmark Module¶
The benchmark module
- spinneret.benchmark.benchmark_against_standard(standard_dir: str, test_dirs: list) DataFrame[source]¶
Benchmarks the performance of test data against a standard. Currently supports select ontologies from the OBO Foundry.
- Parameters:
standard_dir – Directory containing the standard annotated workbook files.
test_dirs – List of directories containing the test annotated workbook files. Each directory represents a different test condition.
- Returns:
A pandas DataFrame containing the benchmark results. Comparisons are made between the standard and test data for each predicate and element_xpath combination. The DataFrame contains the following columns:
standard_dir: The directory containing the standard annotated workbook files.
test_dir: The directory containing the test annotated workbook files.
standard_file: The name of the standard annotated workbook file.
predicate_value: The value of the predicate column.
element_xpath_value: The value of the element_xpath column.
standard_set: The set of object_ids from the standard data.
test_set: The set of object_ids from the test data.
average_score: The average termset similarity score between the standard and test sets.
best_score: The best termset similarity score between the standard and test sets.
average_jaccard_similarity: The average Jaccard similarity score between the standard and test sets.
best_jaccard_similarity: The best Jaccard similarity score between the standard and test sets.
average_phenodigm_score: The average Phenodigm score between the standard and test sets.
best_phenodigm_score: The best Phenodigm score between the standard and test sets.
average_standard_information_content: The average information content score of the standard set.
best_standard_information_content: The best information content score of the standard set.
average_test_information_content: The average information content score of the test set.
best_test_information_content: The best information content score of the test set.
- spinneret.benchmark.clean_workbook(workbook: DataFrame) DataFrame[source]¶
Clean a workbook for benchmarking.
- Parameters:
workbook – The workbook to clean.
- Returns:
The cleaned workbook.
- spinneret.benchmark.compress_object_ids(object_id_groups: dict) dict[source]¶
Convert object_ids to CURIEs for comparison.
- Parameters:
object_id_groups – The return value from group_object_ids.
- Returns:
The object_id_groups dictionary with object_ids converted to CURIEs.
- spinneret.benchmark.default_similarity_scores() dict[source]¶
- Returns:
A dictionary containing default similarity scores. Values are set following oaklib conventions.
- spinneret.benchmark.delete_terms_from_unsupported_ontologies(curies: list) list[source]¶
Similarity scoring works for some ontologies and not others, so remove terms that are not from supported ontologies. Supported ontologies are hard-coded in this function.
- Parameters:
curies – List of CURIEs.
- Returns:
List of CURIEs from supported ontologies.
- spinneret.benchmark.get_grounding_rates(test_dir: str) dict[source]¶
Get the OntoGPT grounding rates of the test data, by predicate.
Predicates may have different grounding rates, due to differences in LLM prompting and the nature of the vocabularies/ontologies being grounded to.
- Parameters:
test_dir – Path to a directory containing the test annotated workbook files.
- Returns:
A nested set of dictionaries containing the grounding rates of the test data. The first level of dictionary keys are the predicates, and the values are a second dictionary with keys “grounded” and “ungrounded”. The values of these keys are the number of grounded and ungrounded terms, respectively.
Get the most shared ontology of two sets based on the most frequently occurring CURIE prefix.
- Parameters:
set1 – List of CURIEs for the first set of terms.
set2 – List of CURIEs for the second set of terms.
- Returns:
The shared ontology. This value is returned as a string conforming to the oaklib conventions for specifying the ontology database input to the termset-similarity function. If no shared ontology is found, None is returned.
- spinneret.benchmark.get_termset_similarity(set1: list, set2: list) dict[source]¶
Calculate the similarity between two sets of terms.
- Parameters:
set1 – List of CURIEs for the first set of terms.
set2 – List of CURIEs for the second set of terms.
- Returns:
A dictionary containing termset similarity and information content scores. Default values, defined in benchmark.default_similarity_scores are returned if the similarity scores cannot be calculated or if an error occurs. For more information on scoring, see the oaklib documentation: https://incatools.github.io/ontology-access-kit/guide/similarity.html.
- spinneret.benchmark.group_object_ids(workbook: DataFrame) dict[source]¶
Group object_id values by predicate and element_xpath, i.e. the context of the object_id values that we are comparing.
- Parameters:
workbook – The workbook to apply the grouping to.
- Returns:
The grouped workbook as a dictionary, where the keys are tuples of the workbook predicate and element_xpath values, and the dictionary values are lists of object_id values.
- spinneret.benchmark.is_grounded(data: list) bool[source]¶
Determine if the list contains a grounded object_id.
- Parameters:
data – List of object_ids.
- Returns:
True if the list contains a grounded object_id, False otherwise. A grounded term is defined as a term that starts with “http”. Ungrounded terms are those that begin with “AUTO:” or are None.
- spinneret.benchmark.monitor(name: str) None[source]¶
Context manager to monitor the duration and memory usage of a function using the daiquiri package logger.
- Parameters:
name – The name of the function being monitored.
- Returns:
None
- spinneret.benchmark.parse_similarity_scores(scores: list) dict[source]¶
Parse similarity scores from the output of the oaklib termset-similarity command into the format expected by the benchmarking function.
- Parameters:
scores – The output of the oaklib termset-similarity command.
- Returns:
A dictionary containing the parsed similarity scores.
- spinneret.benchmark.plot_grounding_rates(grounding_rates: dict, configuration: str, output_file: str = None) None[source]¶
Plot the grounding rates of the test data.
- Parameters:
grounding_rates – The return value from the get_grounding_rates function.
configuration – The configuration of OntoGPT that was used to generate the test data. This is typically the directory name of the test data.
output_file – The path to save the plot to, as a PNG file.
- Returns:
None
- spinneret.benchmark.plot_similarity_scores_by_configuration(benchmark_results: DataFrame, metric: str, output_file: str = None) None[source]¶
To see configuration level performance for an OntoGPT predicate
- Parameters:
benchmark_results – The return value from the benchmark_against_standard function.
metric – The metric to plot. This should be a column name from the benchmark_results DataFrame, e.g. “average_score”, “best_score”, etc.
output_file – The path to save the plot to, as a PNG file.
- Returns:
None
- spinneret.benchmark.plot_similarity_scores_by_predicate(benchmark_results: DataFrame, test_dir_path: str, metric: str, output_file: str = None) None[source]¶
To see predicate level performance for an OntoGPT test configuration
- Parameters:
benchmark_results – The return value from the benchmark_against_standard function.
test_dir_path – Path to the test directory containing the test annotated workbook files for the desired configuration. This should be a value from the test_dir column of the benchmark_results DataFrame, which indicates the configuration comparison to plot.
metric – The metric to plot. This should be a column name from the benchmark_results DataFrame, e.g. “average_score”, “best_score”, etc.
output_file – The path to save the plot to, as a PNG file.
- Returns:
None
Datasets Module¶
The datasets module
EML Module¶
EML metadata related operations
- class spinneret.eml.GeographicCoverage(gc)[source]¶
GeographicCoverage class
- altitude_maximum(to_meters=False) float | None[source]¶
Get altitudeMaximum element value from geographicCoverage
- Parameters:
to_meters – Convert to meters?
- Returns:
altitudeMaximum
- Notes:
A conversion to meters is based on the value retrieved from the altitudeUnits element of the geographic coverage, and a conversion table from the EML specification. If the altitudeUnits element is not present, and the to_meters parameter is True, then the altitude value is returned as-is and a warning issued.
- altitude_minimum(to_meters=False) float | None[source]¶
Get altitudeMinimum element value from geographicCoverage
- Parameters:
to_meters – Convert to meters?
- Returns:
altitudeMinimum
- Notes:
A conversion to meters is based on the value retrieved from the altitudeUnits element of the geographic coverage, and a conversion table from the EML specification. If the altitudeUnits element is not present, and the to_meters parameter is True, then the altitude value is returned as-is and a warning issued.
- altitude_units() str | None[source]¶
Get altitudeUnits element value from geographicCoverage
- Returns:
altitudeUnits
- description() str | None[source]¶
Get geographicDescription element value from geographicCoverage
- Returns:
geographicDescription
- east() float | None[source]¶
Get eastBoundingCoordinate element value from geographicCoverage
- Returns:
eastBoundingCoordinate
- exclusion_gring() str | None[source]¶
Get datasetGPolygonExclusionGRing/gRing element value from geographicCoverage
- Returns:
datasetGPolygonExclusionGRing/gRing
- geom_type(schema='eml') str | None[source]¶
Get geometry type from geographicCoverage
- Param:
Schema dialect to use when returning values, either “eml” or “esri”
- Returns:
geometry type as “polygon”, “point”, or “envelope” for schema=”eml”, or “esriGeometryPolygon”, “esriGeometryPoint”, or “esriGeometryEnvelope” for schema=”esri”
- north() float | None[source]¶
Get northBoundingCoordinate element value from geographicCoverage
- Returns:
northBoundingCoordinate
- outer_gring() str | None[source]¶
Get datasetGPolygonOuterGRing/gRing element value from geographicCoverage
- Returns:
datasetGPolygonOuterGRing/gRing element
- south() float | None[source]¶
Get southBoundingCoordinate element value from geographicCoverage
- Returns:
southBoundingCoordinate
- to_esri_geometry() str | None[source]¶
Convert geographicCoverage to ESRI JSON geometry
- Returns:
ESRI JSON geometry type as “polygon”, “point”, or “envelope”
- Notes:
The logic here presumes that if a polygon is listed, it is the true feature of interest, rather than the associated boundingCoordinates, which are required to be listed by the EML spec alongside all polygon listings.
Geographic coverage latitude and longitude are assumed to be in the spatial reference system of WKID 4326 and are inserted into the ESRI geometry as x and y values. Geographic coverages with altitudes and associated units are converted to units of meters and added to the ESRI geometry as z values.
Geographic coverages that are point locations, as indicated by their bounding box latitude min and max values and longitude min and max values being equivalent, are converted to ESRI envelopes rather than ESRI points, because the envelope geometry type is more expressive and handles more usecases than the point geometry alone. Furthermore, point locations represented as envelope geometries produce the same results as if the point of location was represented as a point geometry.
- to_geojson_geometry() str | None[source]¶
Convert geographicCoverage to GeoJSON geometry
- Returns:
GeoJSON geometry type as “polygon” or “point”
- Notes:
The logic here presumes that if a polygon is listed, it is the true feature of interest, rather than the associated boundingCoordinates, which are required to be listed by the EML spec alongside all polygon listings.
Geographic coverage latitude and longitude are assumed to be in the spatial reference system of WKID 4326 and are inserted into the GeoJSON geometry as x and y values. Geographic coverages with altitudes and associated units are converted to units of meters and added to the GeoJSON geometry as z values.
Geographic coverages that are point locations, as indicated by their bounding box latitude min and max values and longitude min and max values being equivalent, are converted to GeoJSON points.
- spinneret.eml.get_geographic_coverage(eml: str) List[GeographicCoverage][source]¶
Get GeographicCoverage objects from EML metadata
- Parameters:
eml – path to EML metadata
- Returns:
list of geographicCoverage objects
Graph Module¶
The graph module
- spinneret.graph.convert_keyword_url_to_uri(graph: Graph) Graph[source]¶
- Parameters:
graph – Graph of metadata and vocabularies
- Returns:
Graph with keyword URLs converted to URIs
- Notes:
Converts values of schema:keyword/schema:DefinedTerm/schema:url to URI references if the value appears to be a URL.
- spinneret.graph.convert_license_to_uri(graph: Graph) Graph[source]¶
- Parameters:
graph – Graph of metadata and vocabularies
- Returns:
Graph with licenses converted to URIs
- Notes:
Converts values of schema:license to URI references if the value appears to be a URL.
- spinneret.graph.convert_variable_measurement_technique_to_uri(graph: Graph) Graph[source]¶
- Parameters:
graph – Graph of metadata and vocabularies
- Returns:
Graph with variable measurement techniques converted to URIs
- Notes:
Converts values of schema:variableMeasured/schema:PropertyValue/ schema:measurementTechnique to URI references if the value appears to be a URL.
- spinneret.graph.convert_variable_property_id_to_uri(graph: Graph) Graph[source]¶
- Parameters:
graph – Graph of metadata and vocabularies
- Returns:
Graph with variable property IDs converted to URIs
- Notes:
Converts values of schema:variableMeasured/schema:PropertyValue/ schema:propertyID to URI references if the value appears to be a URL.
- spinneret.graph.convert_variable_unit_code_to_uri(graph: Graph) Graph[source]¶
- Parameters:
graph – Graph of metadata and vocabularies
- Returns:
Graph with variable unit codes converted to URIs
- Notes:
Converts values of schema:variableMeasured/schema:PropertyValue/ schema:unitCode to URI references if the value appears to be a URL.
- spinneret.graph.create_graph(metadata_files: list = None, vocabulary_files: list = None) Graph[source]¶
- Parameters:
metadata_files – List of file paths to metadata in JSON-LD format
vocabulary_files – List of file paths to vocabularies
- Returns:
Graph of the combined metadata and vocabularies
- Notes:
If no vocabulary files are provided, only the metadata are loaded into the graph, and vice versa if no metadata files are provided.
Vocabulary formats are identified by the file extension according to rdflib.util.guess_format
Main Module¶
Plot Module¶
The plot module provides a simple interface for plotting data.
Shadow Module¶
A module for creating shadow metadata
- spinneret.shadow.convert_userid_to_url(eml: ElementTree) ElementTree[source]¶
- Parameters:
eml – An EML document
- Returns:
An EML document with userId elements converted to URLs, if not already, and if possible.
- spinneret.shadow.create_shadow_eml(eml_path: str, output_path: str) None[source]¶
- Parameters:
eml_path – The path to the EML file to be annotated.
output_path – The path to write the annotated EML file.
- Returns:
None
- Notes:
This function wraps a set of enrichment functions to create a shadow EML file.
SSSOM Module¶
The SSSOM module
- spinneret.sssom.from_lter(path_in: str, path_out: str) dict[source]¶
Create SSSOM for the LTER Controlled Vocabulary
Create SSSOM files (Simple Standard for Sharing Ontological Mappings) for aligning the LTER Controlled Vocabulary with other vocabularies and ontologies. The returned SSSOM files embody a 3/5 star rating based on https://mapping-commons.github.io/sssom/spec/#minimum. See the related Python toolkit to parse, convert, etc. the SSSOM: https://mapping-commons.github.io/sssom-py/index.html#. For definitions of fields returned in the SSSOM files see: https://mapping-commons.github.io/sssom/.
- Parameters:
path_in – Absolute path to LTER CV .rdf file
path_out – Absolute path to directory where SSSOM files will be written
- Returns:
Dictionary with keys ‘data_path’ and ‘meta_path’ and values as the absolute paths to the SSSOM data and metadata files
- Notes:
Overwriting of the SSSOM does not occur with subsequent calls of this function.
Utilities Module¶
The utilities module
- spinneret.utilities.compress_uri(uri: str) str[source]¶
Compress a URI into a CURIE based on the prefix mappings in the OBO and BioPortal converters.
- Parameters:
uri – The URI to be compressed into a CURIE.
- Returns:
The compressed CURIE. Returns the original URI if the prefix does not have a mapping.
- Notes:
This is a wrapper function around the prefixmaps and curies libraries.
- spinneret.utilities.delete_empty_tags(xml: _ElementTree) _ElementTree[source]¶
Deletes empty tags from an XML file
- Parameters:
xml – The XML file to be cleaned.
- Returns:
The cleaned XML file.
- spinneret.utilities.expand_curie(curie: str) str[source]¶
Expand a CURIE into a URI based on the prefix mappings in the OBO and BioPortal converters.
- Parameters:
curie – The CURIE to be expanded.
- Returns:
The expanded CURIE. Returns the original CURIE if the prefix does not have a mapping.
- Notes:
This is a wrapper function around the prefixmaps and curies libraries.
- spinneret.utilities.get_elements_for_predicate(eml: _ElementTree, predicate: str) list[source]¶
Get the EML elements that corresponds to a predicate. Elements contain the information from which annotations are derived.
- Parameters:
eml – An EML document.
predicate – The predicate to be used to find the element(s).
- Returns:
The element(s) that corresponds to the predicate, each as an etree._Element. If the predicate is not found, returns empty list.
- spinneret.utilities.get_predicate_id_for_predicate(predicate: str) str | None[source]¶
- Parameters:
predicate – The predicate to be used to find the predicate ID.
- Returns:
The predicate ID for the predicate. Returns None if the predicate is not found.
- spinneret.utilities.get_template_for_predicate(predicate: str) str | None[source]¶
- Parameters:
predicate – The predicate to be used to find the template.
- Returns:
The OntoGPT template for the predicate. Returns None if the predicate is not found.
- spinneret.utilities.is_url(text: str) bool[source]¶
- Parameters:
text – The string to be checked.
- Returns:
True if the string is likely a URL, False otherwise.
- Note:
A string is considered a URL if it has scheme and network location values.
- spinneret.utilities.load_configuration(config_file: str) None[source]¶
Loads the configuration file as global environment variables for use by spinneret functions.
- Parameters:
config_file – The path to the configuration file.
- Returns:
None
- Notes:
Create a configuration file from config.json.template in the root directory of the project. Simply copy the template file and rename it to config.json. Fill in the values for the keys in the file.
- spinneret.utilities.load_eml(eml: str | _ElementTree) _ElementTree[source]¶
- Parameters:
eml – The EML file to be loaded.
- Returns:
The loaded EML file.
- spinneret.utilities.load_prefixmaps() dict[source]¶
Load ontology prefix maps. To be used with expand_curie and compress_uri.
- Returns:
The ontology prefix maps
- spinneret.utilities.load_workbook(workbook: str | DataFrame) DataFrame[source]¶
- Parameters:
workbook – The workbook to be loaded.
- Returns:
The loaded workbook.
Workbook Module¶
The workbook module
- spinneret.workbook.create(eml_file: str, elements: list, env: str = 'production', path_out: str = False) DataFrame[source]¶
Create an annotation workbook from an EML file
- Parameters:
eml_file – Path to a single EML file
elements – List of EML elements to include in the workbook. Can be one or more of: ‘dataset’, ‘dataTable’, ‘otherEntity’, ‘spatialVector’, ‘spatialRaster’, ‘storedProcedure’, ‘view’, ‘attribute’.
env – The environment to use for the base URL. Options are: ‘production’, ‘staging’, ‘development’.
path_out – Path to a directory where the workbook will be written. Will result in a file ‘[packageId]_annotation_workbook.tsv’.
- Returns:
DataFrame of the annotation workbook with columns:
package_id: Data package identifier listed in the EML at the xpath attribute packageId
url: Link to the data package landing page corresponding to ‘package_id’
element: Element to be annotated
element_id: UUID assigned at the time of workbook creation
element_xpath: xpath of element to be annotated
context: The broader context in which the subject is found. When the subject is a dataset the context is the packageId. When the subject is a data object/entity, the context is dataset. When the subject is an attribute, the context is the data object/entity the attribute is apart of.
subject: The subject of annotation
predicate: The label of the predicate relating the subject to the object
predicate_id: The identifier of the predicate. Typically, this is a URI/IRI.
object: The label of the object
object_id: The identifier of the object. Typically, this is a URI/IRI.
author: Identifier of the data curator authoring the annotation. Typically, this value is an ORCiD.
date: Date of the annotation. Can be helpful when revisiting annotations.
comment: Comments related to the annotation. Can be useful when revisiting an annotation at a later date.
- spinneret.workbook.delete_annotations(workbook: DataFrame, criteria: dict) DataFrame[source]¶
- Parameters:
workbook – The workbook to delete rows of annotations from.
criteria – A dictionary of key-value pairs to define rows to delete. Each key corresponds to a column in the workbook and each value is a string to match in the column.
- Returns:
The workbook with annotations deleted corresponding to the criteria.
- Notes:
A matching row must contain all key-value pairs in the criteria dictionary to be deleted. Matching is case-sensitive. Partial matches are supported.
- spinneret.workbook.delete_duplicate_annotations(workbook: DataFrame) DataFrame[source]¶
- Parameters:
workbook – The annotation workbook
- Returns:
The workbook with duplicate annotations removed
- Notes:
The function removes duplicate annotations based on the following columns: element_xpath, predicate, predicate_id, object, and object_id. The most recent annotation, based on date, is preferred to allow improvements to other fields set by the annotator.
- spinneret.workbook.delete_unannotated_rows(workbook: DataFrame) DataFrame[source]¶
- Parameters:
workbook – The workbook to remove unannotated rows from.
- Returns:
The workbook with rows that do not have an annotation deleted.
- Notes:
This function may remove potential human annotation opportunities, i.e. rows that have not been annotated by an automated annotator but may be annotated by a human annotator. It is recommended that this function is not applied within existing workbook annotators or the annotate_workbook wrapper due to this limitation.
- spinneret.workbook.get_description(element: _Element) str[source]¶
Get the description of an element
- Parameters:
element – The EML element to be annotated.
- Returns:
The description of the element.
- spinneret.workbook.get_package_id(eml: _ElementTree) str[source]¶
- Parameters:
eml – The EML file as an lxml etree object
- Returns:
The packageId of the EML file
- spinneret.workbook.get_package_url(eml: _ElementTree, env: str = 'production') str[source]¶
- Parameters:
eml – The EML file as an lxml etree object
env – The environment to use for the base URL. Options are: ‘production’, ‘staging’, ‘development’.
- Returns:
The URL to the data package landing page
- spinneret.workbook.get_subject_and_context(element: _Element) dict[source]¶
Get subject and context values for a given element
This function is called by ‘workbook.create’ to get the subject and context values. See ‘workbook.create’ for explanation of parameters.
- Parameters:
element – The EML element to be annotated.
- Returns:
Dictionary with keys ‘subject’ and ‘context’ and values as the subject and context of the element.
- Notes:
Values for the ‘subject’ and ‘context’ of each annotatable element is defined on a case-by-case basis. This approach is taken because a generalizable pattern to derive recognizable and meaningful values for these fields is difficult since annotatable elements (specified by the EML schema) aren’t constrained to leaf nodes with text values.
- spinneret.workbook.initialize_workbook_row() Series[source]¶
Initialize a row for the annotation workbook :returns: A pandas Series with the initialized row