Leveraging Search API Endpoints
The AlphaLens search endpoints use semantic similarity and AI-powered matching to find relevant organizations and products. Understanding how to craft effective queries and interpret results will significantly improve your search accuracy.
Query Formulation Best Practices
1. Be Specific but Not Too Narrow
Good queries include meaningful context:
-
✅ "AI-powered platform for legal workflows"
-
✅ "B2B SaaS customer relationship management software"
-
✅ "bipedal humanoid robotics for industrial automation"
Avoid overly broad or vague queries:
-
❌ "software" (too generic)
-
❌ "technology" (no specificity)
-
❌ "AI" (needs more context)
Avoid overly narrow queries:
-
❌ "blue-colored CRM with exactly 5 integrations" (too restrictive)
-
❌ "founded in 2023 in San Francisco by ex-Google engineers" (focus on what they do, not who/when/where)
2. Focus on Functionality in the query
The search engine works best when you describe what companies/products do rather than e.g. demographic/firmographic characteristics:
Effective approach:
{
"query": "project management software with kanban boards and time tracking"
}
Not a viable option:
{
"query": "companies founded after 2020 with 50-100 employees"
}
Note that demographic/firmographic charasteristics can be applied directly in the filters
3. Use Domain-Specific Terminology
Include industry-standard terms and technical keywords:
IPAA-compliant electronic health records
multi-tenant SaaS architecture
lithium-ion batteries using liquified gas electrolytes
This helps the semantic search engine understand your domain and find more relevant matches.
Working with Search Results
1. Understanding Relevance Scores
Search results include cosine distance scores for different aspects of similarity (for both organization and product searches):
-
*_description_cosine_distance: How well the description matches your query
-
*_similar_cosine_distance: Overall product/organization similarity
-
*_target_audience_cosine_distance: Target market alignment
Setting thresholds will always be case dependent and carries with it precision-recall trade-offs. Generally, “similar” searches will have lower default distances than “description” searches.
2. Advanced Search Patterns to Improve Recall
To improve recall, it can be useful to generate query variants to capture different perspectives as semantic search engines aren’t perfect at capturing meaning nuance.
# Original query
query = "AI-powered legal workflow automation"
# Generate variants
variants = [
"AI legal document management software",
"automated legal workflow platforms",
"machine learning tools for law firms"
]
# Search with all variants
all_results = []
for q in [query] + variants:
results = search_products(q)
all_results.extend(results)
# Deduplicate by domain
unique_results = deduplicate_by_domain(all_results)
3. Optionally Filtering Results to Improve Precision
If precision is important, consider building your integration with an "LLM-in-the-loop" workflow to filter out irrelevant results. Use an LLM to score relevance (0-10) relative to your original query by passing in rich context about each company:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
def llm_score_company(company: dict, original_query: str, threshold: int = 7) -> tuple:
"""
Score a company's relevance to the original query.
Returns: (relevance_score, should_keep)
"""
# Fetch full organization context
org_data = get_company_by_domain(company['active_domain'])
prompt = ChatPromptTemplate.from_messages([
("system", """
You are an expert at identifying false positives in market research.
Score how relevant a company's CORE PRODUCT is to a specific search query on a scale of 0-10.
SCORING CRITERIA (BE STRICT):
- 10 = Perfect match: This IS the company's core product/business
- 8-9 = Strong match: Core product offering, central to their business
- 7 = Good match: Relevant and genuine product in this space
- 4-6 = Moderate: Related but NOT a core offering
- 1-3 = Weak: Barely related, minor feature, or mentioned in passing
- 0 = FALSE POSITIVE: Not relevant, wrong industry, or just a customer
CRITICAL EVALUATION:
1. Is this a CORE OFFERING? (Check Company Overview)
2. Are they BUILDING this technology or just USING it?
3. Would an industry analyst place them in this specific market?
Output ONLY a single number from 0-10.
"""),
("user", """Search Query: {query}
Company: {company_name}
Product Name: {product_name}
Product Description: {product_description}
Company Overview: {organization_description}
Relevance Score (0-10):""")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
chain = prompt | llm
response = chain.invoke({
"query": original_query,
"company_name": company.get('organization_name'),
"product_name": company.get('product_name'),
"product_description": company.get('product_description'),
"organization_description": org_data.get('organization_description', 'N/A')
})
score = int(response.content.strip())
return score, score >= threshold
# Usage
relevance_score, keep = llm_score_company(company, original_query, threshold=7)
Example Scenario: Searching for "AI-powered legal workflow automation"
Result 1: Clio (Legal practice management software)
company_data = {
"organization_name": "Clio",
"product_name": "Clio Manage",
"product_description": "Cloud-based legal practice management software with case management, time tracking, and billing",
"active_domain": "clio.com"
}
org_context = {
"organization_description": "Clio is a legal technology company that builds cloud-based software for law firms, providing practice management, client intake, and document automation."
}
# LLM scores this: 8/10
# Reasoning: Core legal workflow product, but AI is not the primary differentiator
# Verdict: KEEP (score >= 7)
Result 2: Salesforce (CRM with legal industry customers)
company_data = {
"organization_name": "Salesforce",
"product_name": "Salesforce for Legal",
"product_description": "CRM platform with custom configurations for legal teams to manage client relationships and matter tracking",
"active_domain": "salesforce.com"
}
org_context = {
"organization_description": "Salesforce is a cloud-based customer relationship management platform serving businesses across all industries with sales, service, and marketing automation."
}
# LLM scores this: 3/10
# Reasoning: Generic CRM with legal configuration, not purpose-built for legal workflows
# Verdict: FILTER OUT (score < 7)
Result 3: Harvey AI (AI legal co-pilot)
company_data = {
"organization_name": "Harvey",
"product_name": "Harvey AI",
"product_description": "Generative AI platform built specifically for legal workflows, including document drafting, research, and analysis",
"active_domain": "harvey.ai"
}
org_context = {
"organization_description": "Harvey is an AI company building generative AI tools specifically for legal professionals, focusing on automating legal research, document generation, and workflow optimization."
}
# LLM scores this: 10/10
# Reasoning: Perfect match - AI-first, legal-specific, workflow automation core product
# Verdict: KEEP (score >= 7)
Result 4: Microsoft (Uses legal AI internally)
company_data = {
"organization_name": "Microsoft",
"product_name": "Microsoft 365 Copilot",
"product_description": "AI assistant integrated into Office apps, used by some legal teams for document drafting",
"active_domain": "microsoft.com"
}
org_context = {
"organization_description": "Microsoft is a multinational technology company that develops software, hardware, and cloud services. Products span operating systems, productivity software, gaming, and enterprise solutions."
}
# LLM scores this: 2/10
# Reasoning: General productivity tool, legal use case is incidental, not core product focus
# Verdict: FILTER OUT (score < 7)