Skip to main content

LangChain off-the-shelf evaluators

LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios. To learn how to use these evaluators, please refer to the following guide.

note

We currently support off-the-shelf evaluators for Python only, but are adding support for TypeScript soon.

note

Most of these evaluators are useful but imperfect! We recommend against blind trust of any single automated metric and to always incorporate them as a part of a holistic testing and evaluation strategy. Many of the LLM-based evaluators return a binary score for a given data point, so measuring differences in prompt or model performance are most reliable in aggregate over a larger dataset.

The following table enumerates the off-the-shelf evaluators available in LangSmith, along with their output keys and a simple code sample.

Evaluator nameOutput KeySimple Code Example
Q&AcorrectnessLangChainStringEvaluator("qa")
Contextual Q&Acontextual accuracyLangChainStringEvaluator("context_qa")
Chain of Thought Q&Acot contextual accuracyLangChainStringEvaluator("cot_qa")
CriteriaDepends on criteria keyLangChainStringEvaluator("criteria", config={ "criteria": <criterion> })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }
Labeled CriteriaDepends on criteria keyLangChainStringEvaluator("labeled_criteria", config={ "criteria": <criterion> })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }
ScoreDepends on criteria keyLangChainStringEvaluator("score_string", config={ "criteria": <criterion>, "normalize_by": 10 })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }. Scores are out of 10, so normalize_by will cast this to a score from 0 to 1.
Labeled ScoreDepends on criteria keyLangChainStringEvaluator("labeled_score_string", config={ "criteria": <criterion>, "normalize_by": 10 })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }. Scores are out of 10, so normalize_by will cast this to a score from 0 to 1.
Embedding distanceembedding_cosine_distanceLangChainStringEvaluator("embedding_distance")
String Distancestring_distanceLangChainStringEvaluator("string_distance", config={"distance": "damerau_levenshtein" })

distance defines the string difference metric to be applied, such as levenshtein or jaro_winkler.
Exact Matchexact_matchLangChainStringEvaluator("exact_match")
Regex Matchregex_matchLangChainStringEvaluator("regex_match")
Json Validityjson_validityLangChainStringEvaluator("json_validity")
Json Equalityjson_equalityLangChainStringEvaluator("json_equality")
Json Edit Distancejson_edit_distanceLangChainStringEvaluator("json_edit_distance")
Json Schemajson_schemaLangChainStringEvaluator("json_schema")

Was this page helpful?