Developer Interface¶

Annotator Module¶

The annotator module

spinneret.annotator.add_predicate_annotations_to_workbook(predicate: str, workbook: str | DataFrame, eml: str | _ElementTree, output_path: str = None, overwrite: bool = False, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False, sample_size: int = 1) → DataFrame[source]¶

Parameters:

predicate – The predicate label for the annotation. This guides the annotation process with which OntoGPT template to use. The options are: contains measurements of type, contains process, env_broad_scale, env_local_scale, environmental material, research topic, usesMethod, uses standard.
workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
output_path – The path to write the annotated workbook.
overwrite – If True, overwrite existing annotations in the workbook, so a fresh set may be created. Only annotations with the same predicate as the predicate input will be removed.
local_model – See get_ontogpt_annotation documentation for details.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – See get_ontogpt_annotation documentation for details.
sample_size – Executes multiple replicates of the annotation request to reduce variability of outputs. Variability is inherent in OntoGPT.

Returns:

Workbook with predicate annotations.

Notes:

This function retrieves annotations using OntoGPT, except for the uses standard which uses a deterministic method. OntoGPT requires setup and configuration described in the get_ontogpt_annotation function.

spinneret.annotator.add_qudt_annotations_to_workbook(workbook: str | DataFrame, eml: str | _ElementTree, output_path: str = None, overwrite: bool = False) → DataFrame[source]¶

Parameters:

workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
output_path – The path to write the annotated workbook.
overwrite – If True, overwrite existing QUDT annotations in the `workbook, so a fresh set may be created.

Returns:

Workbook with QUDT annotations.

spinneret.annotator.annotate_eml(eml: str | _ElementTree, workbook: str | DataFrame, output_path: str = None) → _ElementTree[source]¶

Annotate an EML file with terms from the corresponding workbook

Parameters:

eml – Either the path to the EML file corresponding to the workbook, or the EML file itself as an lxml etree.
workbook – Either the path to the workbook corresponding to the eml, or the workbook itself as a pandas DataFrame.
output_path – The path to write the annotated EML file.

Returns:

The annotated EML file as an lxml etree.

Notes:

The EML file is annotated with terms from the corresponding workbook. Terms from the workbook are added even if they are already present in the EML file.

spinneret.annotator.annotate_workbook(workbook_path: str, eml_path: str, output_path: str, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False, sample_size: int = 1) → None[source]¶

Annotate a workbook with automated annotation

Parameters:

workbook_path – The path to the workbook to be annotated corresponding to the EML file.
eml_path – The path to the EML file corresponding to the workbook.
output_path – The path to write the annotated workbook.
local_model – See get_ontogpt_annotation documentation for details.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – See get_ontogpt_annotation documentation for details.
sample_size – Executes multiple replicates of the annotation request to reduce variability of outputs. Variability is inherent in OntoGPT.

Returns:

None

Notes:

The workbook is annotated by annotators best suited for the XPaths in the EML file. The annotated workbook is written back to the same path as the original workbook.

spinneret.annotator.create_annotation_element(predicate_label, predicate_id, object_label, object_id)[source]¶

Create an EML annotation element

Parameters:

predicate_label – The predicate label of the annotation.
predicate_id – The URI of the predicate.
object_label – The object label of the annotation.
object_id – The URI of the object.

spinneret.annotator.get_annotation_from_workbook(workbook: str | DataFrame, element: str, description: str, predicate: str) → list | None[source]¶

Parameters:

workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
element – The element to retrieve annotations for.
description – The description of the element to retrieve annotations for.
predicate – The predicate to retrieve annotations for.

Returns:

A list of dictionaries, each with the annotation keys label (same as object column in workbook), uri (same as object_id column in workbook). None if no annotations are found for the given element name.

Notes:

This function returns existing annotations from the workbook if the element, description, and predicate match, and the object and object_id are not empty. This is useful when one or more data entities have several attributes of different names but the same meaning.

spinneret.annotator.get_bioportal_annotation(text: str, api_key: str, ontologies: str, semantic_types: str = '', expand_semantic_types_hierarchy: str = 'false', expand_class_hierarchy: str = 'false', class_hierarchy_max_level: int = 0, expand_mappings: str = 'false', stop_words: str = '', minimum_match_length: int = 3, exclude_numbers: str = 'false', whole_word_only: str = 'true', exclude_synonyms: str = 'false', longest_only: str = 'false') → list | None[source]¶

Get an annotation from the BioPortal API

Parameters:

text – The text to be annotated.
api_key – The BioPortal API key.
ontologies – The ontologies to use for annotation.
semantic_types – The semantic types to use for annotation.
expand_semantic_types_hierarchy – true means to use the semantic types passed in the “semantic_types” parameter as well as all their immediate children. false means to use ONLY the semantic types passed in the “semantic_types” parameter.
expand_class_hierarchy – used only in conjunction with “class_hierarchy_max_level” parameter; determines whether or not to include ancestors of the given class when performing an annotation.
class_hierarchy_max_level – the depth of the hierarchy to use when performing an annotation.
expand_mappings – true means that the following manual mappings will be used in annotation: UMLS, REST, CUI, OBOXREF.
stop_words – a comma-separated list of words to ignore in the text.
minimum_match_length – the minimum number of characters in a term that must be matched in the text.
exclude_numbers – true means to exclude numbers from annotation.
whole_word_only – true means to match whole words only.
exclude_synonyms – true means to exclude synonyms from annotation.
longest_only – true means that only the longest match for a given phrase will be returned.

Returns:

A list of dictionaries, each with the annotation keys label and uri, corresponding to the preferred label and URI of the annotated concept. None if the request fails.

Notes:

This function is a wrapper for the BioPortal API. The BioPortal API is a repository of biomedical ontologies with a RESTful API that allows users to annotate text with ontology concepts. The API is documented at https://data.bioontology.org/documentation#nav_annotator.

This function requires an API key from BioPortal. To obtain an API key, users must register at https://bioportal.bioontology.org/account. The key can be loaded as an environment variable from the configuration file (see utilities.load_configuration).

spinneret.annotator.get_ontogpt_annotation(text: str, template: str, local_model: str = None, temperature: float | None = None, return_ungrounded: bool = False) → list | None[source]¶

Parameters:

text – The text to be annotated.
template – Name of OntoGPT template to use for grounding. Available templates are in src/data/ontogpt/templates. Omit the file extension.
local_model – The local language model to use (e.g. llama3.2). This should be one of the options available from ollama (see https://ollama.com/library) and should be installed locally. If None, the configured remote model will be used. See the OntoGPT documentation for more information.
temperature – The temperature parameter for the model. If None, the OntoGPT default will be used.
return_ungrounded – If True, return ungrounded annotations. These may be useful in identifying potential concepts to add to a vocabulary, or to identify concepts that a human curator may be capable of grounding.

Returns:

A list of dictionaries, each with the annotation keys label and uri. None if the request fails or no annotations are found.

Notes:

This function is a wrapper for the OntoGPT API. Set up of OntoGPT is required to use this function. For more information, see: https://monarch-initiative.github.io/ontogpt/.

spinneret.annotator.get_qudt_annotation(text: str) → list | None[source]¶

Get an annotation from the QUDT API

Parameters:: text – The text to be annotated. This should be the value from the EML standardUnit or customUnit element.
Returns:: A list of dictionaries, each with the annotation keys label and uri, corresponding to the preferred label and URI of the annotated concept. None if the request fails.
Notes:: This function queries the Unit Annotations Service https://vocab.lternet.edu/unitsws.html, developed by the EDI and LTER units working group, for a match of the input text to a QUDT unit via the service mapping.

spinneret.annotator.has_annotation(workbook: str | DataFrame, element_xpath: str, predicate: str) → bool[source]¶

Parameters:

workbook – Either the path to the workbook to be annotated, or the workbook itself as a pandas DataFrame.
element_xpath – The XPath of the element to check for annotations.
predicate – The predicate to check for annotations.

Returns:

True if the workbook contains an element_xpath that has an annotation for the given predicate. False otherwise.

Benchmark Module¶

The benchmark module

spinneret.benchmark.benchmark_against_standard(standard_dir: str, test_dirs: list) → DataFrame[source]¶

Benchmarks the performance of test data against a standard. Currently supports select ontologies from the OBO Foundry.

Parameters:

standard_dir – Directory containing the standard annotated workbook files.
test_dirs – List of directories containing the test annotated workbook files. Each directory represents a different test condition.

Returns:

A pandas DataFrame containing the benchmark results. Comparisons are made between the standard and test data for each predicate and element_xpath combination. The DataFrame contains the following columns:

standard_dir: The directory containing the standard annotated workbook files.
test_dir: The directory containing the test annotated workbook files.
standard_file: The name of the standard annotated workbook file.
predicate_value: The value of the predicate column.
element_xpath_value: The value of the element_xpath column.
standard_set: The set of object_ids from the standard data.
test_set: The set of object_ids from the test data.
average_score: The average termset similarity score between the standard and test sets.
best_score: The best termset similarity score between the standard and test sets.
average_jaccard_similarity: The average Jaccard similarity score between the standard and test sets.
best_jaccard_similarity: The best Jaccard similarity score between the standard and test sets.
average_phenodigm_score: The average Phenodigm score between the standard and test sets.
best_phenodigm_score: The best Phenodigm score between the standard and test sets.
average_standard_information_content: The average information content score of the standard set.
best_standard_information_content: The best information content score of the standard set.
average_test_information_content: The average information content score of the test set.
best_test_information_content: The best information content score of the test set.

spinneret.benchmark.clean_workbook(workbook: DataFrame) → DataFrame[source]¶

Clean a workbook for benchmarking.

Parameters:: workbook – The workbook to clean.
Returns:: The cleaned workbook.

spinneret.benchmark.compress_object_ids(object_id_groups: dict) → dict[source]¶

Convert object_ids to CURIEs for comparison.

Parameters:: object_id_groups – The return value from group_object_ids.
Returns:: The object_id_groups dictionary with object_ids converted to CURIEs.

spinneret.benchmark.default_similarity_scores() → dict[source]¶

Returns:: A dictionary containing default similarity scores. Values are set following oaklib conventions.

spinneret.benchmark.delete_terms_from_unsupported_ontologies(curies: list) → list[source]¶

Similarity scoring works for some ontologies and not others, so remove terms that are not from supported ontologies. Supported ontologies are hard-coded in this function.

Parameters:: curies – List of CURIEs.
Returns:: List of CURIEs from supported ontologies.

spinneret.benchmark.get_grounding_rates(test_dir: str) → dict[source]¶

Get the OntoGPT grounding rates of the test data, by predicate.

Predicates may have different grounding rates, due to differences in LLM prompting and the nature of the vocabularies/ontologies being grounded to.

Parameters:: test_dir – Path to a directory containing the test annotated workbook files.
Returns:: A nested set of dictionaries containing the grounding rates of the test data. The first level of dictionary keys are the predicates, and the values are a second dictionary with keys “grounded” and “ungrounded”. The values of these keys are the number of grounded and ungrounded terms, respectively.

spinneret.benchmark.get_shared_ontology(set1: list, set2: list) → str | None[source]¶

Get the most shared ontology of two sets based on the most frequently occurring CURIE prefix.

Parameters:

set1 – List of CURIEs for the first set of terms.
set2 – List of CURIEs for the second set of terms.

Returns:

The shared ontology. This value is returned as a string conforming to the oaklib conventions for specifying the ontology database input to the termset-similarity function. If no shared ontology is found, None is returned.

spinneret.benchmark.get_termset_similarity(set1: list, set2: list) → dict[source]¶

Calculate the similarity between two sets of terms.

Parameters:

set1 – List of CURIEs for the first set of terms.
set2 – List of CURIEs for the second set of terms.

Returns:

A dictionary containing termset similarity and information content scores. Default values, defined in benchmark.default_similarity_scores are returned if the similarity scores cannot be calculated or if an error occurs. For more information on scoring, see the oaklib documentation: https://incatools.github.io/ontology-access-kit/guide/similarity.html.

spinneret.benchmark.group_object_ids(workbook: DataFrame) → dict[source]¶

Group object_id values by predicate and element_xpath, i.e. the context of the object_id values that we are comparing.

Parameters:: workbook – The workbook to apply the grouping to.
Returns:: The grouped workbook as a dictionary, where the keys are tuples of the workbook predicate and element_xpath values, and the dictionary values are lists of object_id values.

spinneret.benchmark.is_grounded(data: list) → bool[source]¶

Determine if the list contains a grounded object_id.

Parameters:: data – List of object_ids.
Returns:: True if the list contains a grounded object_id, False otherwise. A grounded term is defined as a term that starts with “http”. Ungrounded terms are those that begin with “AUTO:” or are None.

spinneret.benchmark.monitor(name: str) → None[source]¶

Context manager to monitor the duration and memory usage of a function using the daiquiri package logger.

Parameters:: name – The name of the function being monitored.
Returns:: None

spinneret.benchmark.parse_similarity_scores(scores: list) → dict[source]¶

Parse similarity scores from the output of the oaklib termset-similarity command into the format expected by the benchmarking function.

Parameters:: scores – The output of the oaklib termset-similarity command.
Returns:: A dictionary containing the parsed similarity scores.

spinneret.benchmark.plot_grounding_rates(grounding_rates: dict, configuration: str, output_file: str = None) → None[source]¶

Plot the grounding rates of the test data.

Parameters:

grounding_rates – The return value from the get_grounding_rates function.
configuration – The configuration of OntoGPT that was used to generate the test data. This is typically the directory name of the test data.
output_file – The path to save the plot to, as a PNG file.

Returns:

None

spinneret.benchmark.plot_similarity_scores_by_configuration(benchmark_results: DataFrame, metric: str, output_file: str = None) → None[source]¶

To see configuration level performance for an OntoGPT predicate

Parameters:

benchmark_results – The return value from the benchmark_against_standard function.
metric – The metric to plot. This should be a column name from the benchmark_results DataFrame, e.g. “average_score”, “best_score”, etc.
output_file – The path to save the plot to, as a PNG file.

Returns:

None

spinneret.benchmark.plot_similarity_scores_by_predicate(benchmark_results: DataFrame, test_dir_path: str, metric: str, output_file: str = None) → None[source]¶

To see predicate level performance for an OntoGPT test configuration

Parameters:

benchmark_results – The return value from the benchmark_against_standard function.
test_dir_path – Path to the test directory containing the test annotated workbook files for the desired configuration. This should be a value from the test_dir column of the benchmark_results DataFrame, which indicates the configuration comparison to plot.
metric – The metric to plot. This should be a column name from the benchmark_results DataFrame, e.g. “average_score”, “best_score”, etc.
output_file – The path to save the plot to, as a PNG file.

Returns:

None

Datasets Module¶

The datasets module

spinneret.datasets.get_example_eml_dir()[source]¶

Returns:: Path to directory of EML files for use in examples

EML Module¶

EML metadata related operations

class spinneret.eml.GeographicCoverage(gc)[source]¶

GeographicCoverage class

altitude_maximum(to_meters=False) → float | None[source]¶

Get altitudeMaximum element value from geographicCoverage

Parameters:: to_meters – Convert to meters?
Returns:: altitudeMaximum
Notes:: A conversion to meters is based on the value retrieved from the altitudeUnits element of the geographic coverage, and a conversion table from the EML specification. If the altitudeUnits element is not present, and the to_meters parameter is True, then the altitude value is returned as-is and a warning issued.

altitude_minimum(to_meters=False) → float | None[source]¶

Get altitudeMinimum element value from geographicCoverage

Parameters:: to_meters – Convert to meters?
Returns:: altitudeMinimum
Notes:: A conversion to meters is based on the value retrieved from the altitudeUnits element of the geographic coverage, and a conversion table from the EML specification. If the altitudeUnits element is not present, and the to_meters parameter is True, then the altitude value is returned as-is and a warning issued.

altitude_units() → str | None[source]¶

Get altitudeUnits element value from geographicCoverage

Returns:: altitudeUnits

description() → str | None[source]¶

Get geographicDescription element value from geographicCoverage

Returns:: geographicDescription

east() → float | None[source]¶

Get eastBoundingCoordinate element value from geographicCoverage

Returns:: eastBoundingCoordinate

exclusion_gring() → str | None[source]¶

Get datasetGPolygonExclusionGRing/gRing element value from geographicCoverage

Returns:: datasetGPolygonExclusionGRing/gRing

geom_type(schema='eml') → str | None[source]¶

Get geometry type from geographicCoverage

Param:: Schema dialect to use when returning values, either “eml” or “esri”
Returns:: geometry type as “polygon”, “point”, or “envelope” for schema=”eml”, or “esriGeometryPolygon”, “esriGeometryPoint”, or “esriGeometryEnvelope” for schema=”esri”

north() → float | None[source]¶

Get northBoundingCoordinate element value from geographicCoverage

Returns:: northBoundingCoordinate

outer_gring() → str | None[source]¶

Get datasetGPolygonOuterGRing/gRing element value from geographicCoverage

Returns:: datasetGPolygonOuterGRing/gRing element

south() → float | None[source]¶

Get southBoundingCoordinate element value from geographicCoverage

Returns:: southBoundingCoordinate

to_esri_geometry() → str | None[source]¶

Convert geographicCoverage to ESRI JSON geometry

Returns:

ESRI JSON geometry type as “polygon”, “point”, or “envelope”

Notes:

The logic here presumes that if a polygon is listed, it is the true feature of interest, rather than the associated boundingCoordinates, which are required to be listed by the EML spec alongside all polygon listings.

Geographic coverage latitude and longitude are assumed to be in the spatial reference system of WKID 4326 and are inserted into the ESRI geometry as x and y values. Geographic coverages with altitudes and associated units are converted to units of meters and added to the ESRI geometry as z values.

Geographic coverages that are point locations, as indicated by their bounding box latitude min and max values and longitude min and max values being equivalent, are converted to ESRI envelopes rather than ESRI points, because the envelope geometry type is more expressive and handles more usecases than the point geometry alone. Furthermore, point locations represented as envelope geometries produce the same results as if the point of location was represented as a point geometry.

to_geojson_geometry() → str | None[source]¶

Convert geographicCoverage to GeoJSON geometry

Returns:

GeoJSON geometry type as “polygon” or “point”

Notes:

The logic here presumes that if a polygon is listed, it is the true feature of interest, rather than the associated boundingCoordinates, which are required to be listed by the EML spec alongside all polygon listings.

Geographic coverage latitude and longitude are assumed to be in the spatial reference system of WKID 4326 and are inserted into the GeoJSON geometry as x and y values. Geographic coverages with altitudes and associated units are converted to units of meters and added to the GeoJSON geometry as z values.

Geographic coverages that are point locations, as indicated by their bounding box latitude min and max values and longitude min and max values being equivalent, are converted to GeoJSON points.

west() → float | None[source]¶

Get westBoundingCoordinate element value from geographicCoverage

Returns:: westBoundingCoordinate

spinneret.eml.get_geographic_coverage(eml: str) → List[GeographicCoverage][source]¶

Get GeographicCoverage objects from EML metadata

Parameters:: eml – path to EML metadata
Returns:: list of geographicCoverage objects

Graph Module¶

The graph module

spinneret.graph.convert_keyword_url_to_uri(graph: Graph) → Graph[source]¶

Parameters:: graph – Graph of metadata and vocabularies
Returns:: Graph with keyword URLs converted to URIs
Notes:: Converts values of schema:keyword/schema:DefinedTerm/schema:url to URI references if the value appears to be a URL.

spinneret.graph.convert_license_to_uri(graph: Graph) → Graph[source]¶

Parameters:: graph – Graph of metadata and vocabularies
Returns:: Graph with licenses converted to URIs
Notes:: Converts values of schema:license to URI references if the value appears to be a URL.

spinneret.graph.convert_variable_measurement_technique_to_uri(graph: Graph) → Graph[source]¶

Parameters:: graph – Graph of metadata and vocabularies
Returns:: Graph with variable measurement techniques converted to URIs
Notes:: Converts values of schema:variableMeasured/schema:PropertyValue/ schema:measurementTechnique to URI references if the value appears to be a URL.

spinneret.graph.convert_variable_property_id_to_uri(graph: Graph) → Graph[source]¶

Parameters:: graph – Graph of metadata and vocabularies
Returns:: Graph with variable property IDs converted to URIs
Notes:: Converts values of schema:variableMeasured/schema:PropertyValue/ schema:propertyID to URI references if the value appears to be a URL.

spinneret.graph.convert_variable_unit_code_to_uri(graph: Graph) → Graph[source]¶

Parameters:: graph – Graph of metadata and vocabularies
Returns:: Graph with variable unit codes converted to URIs
Notes:: Converts values of schema:variableMeasured/schema:PropertyValue/ schema:unitCode to URI references if the value appears to be a URL.

spinneret.graph.create_graph(metadata_files: list = None, vocabulary_files: list = None) → Graph[source]¶

Parameters:

metadata_files – List of file paths to metadata in JSON-LD format
vocabulary_files – List of file paths to vocabularies

Returns:

Graph of the combined metadata and vocabularies

Notes:

If no vocabulary files are provided, only the metadata are loaded into the graph, and vice versa if no metadata files are provided.

Vocabulary formats are identified by the file extension according to rdflib.util.guess_format

Main Module¶

Plot Module¶

The plot module provides a simple interface for plotting data.

Shadow Module¶

A module for creating shadow metadata

spinneret.shadow.convert_userid_to_url(eml: ElementTree) → ElementTree[source]¶

Parameters:: eml – An EML document
Returns:: An EML document with userId elements converted to URLs, if not already, and if possible.

spinneret.shadow.create_shadow_eml(eml_path: str, output_path: str) → None[source]¶

Parameters:

eml_path – The path to the EML file to be annotated.
output_path – The path to write the annotated EML file.

Returns:

None

Notes:

This function wraps a set of enrichment functions to create a shadow EML file.

SSSOM Module¶

The SSSOM module

spinneret.sssom.from_lter(path_in: str, path_out: str) → dict[source]¶

Create SSSOM for the LTER Controlled Vocabulary

Create SSSOM files (Simple Standard for Sharing Ontological Mappings) for aligning the LTER Controlled Vocabulary with other vocabularies and ontologies. The returned SSSOM files embody a 3/5 star rating based on https://mapping-commons.github.io/sssom/spec/#minimum. See the related Python toolkit to parse, convert, etc. the SSSOM: https://mapping-commons.github.io/sssom-py/index.html#. For definitions of fields returned in the SSSOM files see: https://mapping-commons.github.io/sssom/.

Parameters:

path_in – Absolute path to LTER CV .rdf file
path_out – Absolute path to directory where SSSOM files will be written

Returns:

Dictionary with keys ‘data_path’ and ‘meta_path’ and values as the absolute paths to the SSSOM data and metadata files

Notes:

Overwriting of the SSSOM does not occur with subsequent calls of this function.

Utilities Module¶

The utilities module

spinneret.utilities.compress_uri(uri: str) → str[source]¶

Compress a URI into a CURIE based on the prefix mappings in the OBO and BioPortal converters.

Parameters:: uri – The URI to be compressed into a CURIE.
Returns:: The compressed CURIE. Returns the original URI if the prefix does not have a mapping.
Notes:: This is a wrapper function around the prefixmaps and curies libraries.

spinneret.utilities.delete_empty_tags(xml: _ElementTree) → _ElementTree[source]¶

Deletes empty tags from an XML file

Parameters:: xml – The XML file to be cleaned.
Returns:: The cleaned XML file.

spinneret.utilities.expand_curie(curie: str) → str[source]¶

Expand a CURIE into a URI based on the prefix mappings in the OBO and BioPortal converters.

Parameters:: curie – The CURIE to be expanded.
Returns:: The expanded CURIE. Returns the original CURIE if the prefix does not have a mapping.
Notes:: This is a wrapper function around the prefixmaps and curies libraries.

spinneret.utilities.get_elements_for_predicate(eml: _ElementTree, predicate: str) → list[source]¶

Get the EML elements that corresponds to a predicate. Elements contain the information from which annotations are derived.

Parameters:

eml – An EML document.
predicate – The predicate to be used to find the element(s).

Returns:

The element(s) that corresponds to the predicate, each as an etree._Element. If the predicate is not found, returns empty list.

spinneret.utilities.get_predicate_id_for_predicate(predicate: str) → str | None[source]¶

Parameters:: predicate – The predicate to be used to find the predicate ID.
Returns:: The predicate ID for the predicate. Returns None if the predicate is not found.

spinneret.utilities.get_template_for_predicate(predicate: str) → str | None[source]¶

Parameters:: predicate – The predicate to be used to find the template.
Returns:: The OntoGPT template for the predicate. Returns None if the predicate is not found.

spinneret.utilities.is_url(text: str) → bool[source]¶

Parameters:: text – The string to be checked.
Returns:: True if the string is likely a URL, False otherwise.
Note:: A string is considered a URL if it has scheme and network location values.

spinneret.utilities.load_configuration(config_file: str) → None[source]¶

Loads the configuration file as global environment variables for use by spinneret functions.

Parameters:: config_file – The path to the configuration file.
Returns:: None
Notes:: Create a configuration file from config.json.template in the root directory of the project. Simply copy the template file and rename it to config.json. Fill in the values for the keys in the file.

spinneret.utilities.load_eml(eml: str | _ElementTree) → _ElementTree[source]¶

Parameters:: eml – The EML file to be loaded.
Returns:: The loaded EML file.

spinneret.utilities.load_prefixmaps() → dict[source]¶

Load ontology prefix maps. To be used with expand_curie and compress_uri.

Returns:: The ontology prefix maps

spinneret.utilities.load_workbook(workbook: str | DataFrame) → DataFrame[source]¶

Parameters:: workbook – The workbook to be loaded.
Returns:: The loaded workbook.

spinneret.utilities.write_eml(eml: _ElementTree, output_path: str) → None[source]¶

Parameters:

eml – The EML file to be written.
output_path – The path to write the EML file to.

Returns:

None

spinneret.utilities.write_workbook(workbook: DataFrame, output_path: str) → None[source]¶

Parameters:

workbook – The workbook to be written.
output_path – The path to write the workbook to.

Returns:

None

Workbook Module¶

The workbook module

spinneret.workbook.create(eml_file: str, elements: list, env: str = 'production', path_out: str = False) → DataFrame[source]¶

Create an annotation workbook from an EML file

Parameters:

eml_file – Path to a single EML file
elements – List of EML elements to include in the workbook. Can be one or more of: ‘dataset’, ‘dataTable’, ‘otherEntity’, ‘spatialVector’, ‘spatialRaster’, ‘storedProcedure’, ‘view’, ‘attribute’.
env – The environment to use for the base URL. Options are: ‘production’, ‘staging’, ‘development’.
path_out – Path to a directory where the workbook will be written. Will result in a file ‘[packageId]_annotation_workbook.tsv’.

Returns:

DataFrame of the annotation workbook with columns:

package_id: Data package identifier listed in the EML at the xpath attribute packageId
url: Link to the data package landing page corresponding to ‘package_id’
element: Element to be annotated
element_id: UUID assigned at the time of workbook creation
element_xpath: xpath of element to be annotated
context: The broader context in which the subject is found. When the subject is a dataset the context is the packageId. When the subject is a data object/entity, the context is dataset. When the subject is an attribute, the context is the data object/entity the attribute is apart of.
subject: The subject of annotation
predicate: The label of the predicate relating the subject to the object
predicate_id: The identifier of the predicate. Typically, this is a URI/IRI.
object: The label of the object
object_id: The identifier of the object. Typically, this is a URI/IRI.
author: Identifier of the data curator authoring the annotation. Typically, this value is an ORCiD.
date: Date of the annotation. Can be helpful when revisiting annotations.
comment: Comments related to the annotation. Can be useful when revisiting an annotation at a later date.

spinneret.workbook.delete_annotations(workbook: DataFrame, criteria: dict) → DataFrame[source]¶

Parameters:

workbook – The workbook to delete rows of annotations from.
criteria – A dictionary of key-value pairs to define rows to delete. Each key corresponds to a column in the workbook and each value is a string to match in the column.

Returns:

The workbook with annotations deleted corresponding to the criteria.

Notes:

A matching row must contain all key-value pairs in the criteria dictionary to be deleted. Matching is case-sensitive. Partial matches are supported.

spinneret.workbook.delete_duplicate_annotations(workbook: DataFrame) → DataFrame[source]¶

Parameters:: workbook – The annotation workbook
Returns:: The workbook with duplicate annotations removed
Notes:: The function removes duplicate annotations based on the following columns: element_xpath, predicate, predicate_id, object, and object_id. The most recent annotation, based on date, is preferred to allow improvements to other fields set by the annotator.

spinneret.workbook.delete_unannotated_rows(workbook: DataFrame) → DataFrame[source]¶

Parameters:: workbook – The workbook to remove unannotated rows from.
Returns:: The workbook with rows that do not have an annotation deleted.
Notes:: This function may remove potential human annotation opportunities, i.e. rows that have not been annotated by an automated annotator but may be annotated by a human annotator. It is recommended that this function is not applied within existing workbook annotators or the annotate_workbook wrapper due to this limitation.

spinneret.workbook.get_description(element: _Element) → str[source]¶

Get the description of an element

Parameters:: element – The EML element to be annotated.
Returns:: The description of the element.

spinneret.workbook.get_package_id(eml: _ElementTree) → str[source]¶

Parameters:: eml – The EML file as an lxml etree object
Returns:: The packageId of the EML file

spinneret.workbook.get_package_url(eml: _ElementTree, env: str = 'production') → str[source]¶

Parameters:

eml – The EML file as an lxml etree object
env – The environment to use for the base URL. Options are: ‘production’, ‘staging’, ‘development’.

Returns:

The URL to the data package landing page

spinneret.workbook.get_subject_and_context(element: _Element) → dict[source]¶

Get subject and context values for a given element

This function is called by ‘workbook.create’ to get the subject and context values. See ‘workbook.create’ for explanation of parameters.

Parameters:: element – The EML element to be annotated.
Returns:: Dictionary with keys ‘subject’ and ‘context’ and values as the subject and context of the element.
Notes:: Values for the ‘subject’ and ‘context’ of each annotatable element is defined on a case-by-case basis. This approach is taken because a generalizable pattern to derive recognizable and meaningful values for these fields is difficult since annotatable elements (specified by the EML schema) aren’t constrained to leaf nodes with text values.

spinneret.workbook.initialize_workbook_row() → Series[source]¶: Initialize a row for the annotation workbook :returns: A pandas Series with the initialized row

spinneret.workbook.is_unannotated_row(row: Series) → bool[source]¶

Parameters:: row – A row from the workbook
Returns:: True if the row is unannotated, i.e. one or more of predicate, predicate_id, object, object_id are missing. Otherwise, False.

spinneret.workbook.list_workbook_columns() → list[source]¶

Returns:: A list of the columns in the workbook.

Developer Interface¶

Annotator Module¶

Benchmark Module¶

Datasets Module¶

EML Module¶

Graph Module¶

Main Module¶

Plot Module¶

Shadow Module¶

SSSOM Module¶

Utilities Module¶

Workbook Module¶

Useful Links

Table of Contents

Related Topics