DataAxleFilter: Automating Firm Data Extraction & Filtering

Introduction

Last semester, I worked on a confidential consultancy project in my MBA cohort under Prof. John Burr’s guidance. Our client wanted to expand operations across the U.S. by investing up to $50 MM in the industrial casting sector, but the sheer volume of firms made manual data gathering impossible. Necessity being the mother of invention, I built a pipeline to pull firm data by NAICS/SIC codes (or keywords) and automate filtering based on custom criteria.

In this post, you’ll see:

How to grab a full CSV from DataAxle by NAICS, SIC, or keywords
Why we augment code-based filters with simple web-crawling (to weed out false positives)
How to layer on AI (using Agno with Perplexity) for thesis-driven “fit” scoring
A walk-through of the complete Python script that outputs a ready-to-go Excel sheet

Downloading Your Base Dataset from DataAxle

Use the filtering options to narrow down your dataset:
1. Location, industry, and other relevant criteria.
2. For an exhaustive list, select Business Type and then Keywords/SIC/NAICS.

Enter the relevant keywords or NAICS codes, select all that apply, and click Update Count.

On the View Results page, select all records (page by page) and click Download.
In the download dialog:
1. Choose Comma Delimited (CSV) format.
2. Select the Detailed option in step two.

Filtering by NAICs vs Keyword

Code-only Limitations

Some firms tag themselves under a broad NAICS code but don’t actually operate in that space. For example, you might find a financial services firm classified under a manufacturing NAICS simply because of legacy or broad categorization.

Keyword Crawling

To weed out these false positives, we crawl each firm’s website:

Fetch the homepage (and subpages) using requests/BeautifulSoup.
Search for context signals e.g. “casting,” “foundry,” “precision,” “semiconductor,” “chip,” “electronics.”
Drop any firms that never mention your target terms in their site content.

Core Python Pipeline

The Python pipeline is straightforward:

Load the CSV into a Pandas DataFrame.
Apply code-based filters (Location, Employee Size, Revenue, Expenses, Year Established, etc).
Perform keyword crawling on filtered results.
Compute custom ratios, if required.
Append word counts and AI scores.
Export the final DataFrame to Excel.

This basic flow powered our consultancy project then AI advancements made it even more powerful.

Adding AI-Driven Scoring

For AI capabilities, we need some more libraries like agno, openai, perplexity. The requirement.txt can be found at the end of the blog. To use this, you need define your thesis, what is the ideal strategic fit for your requirements and then use it as an instruction for the ai agent.

Setup the Agent

Define your acquisition thesis as instructions for the AI agent. Below is an example prompt for scoring mode (1–10 scale):

agent = Agent(
    model=Perplexity(id="sonar"),
    markdown=True,
    instructions=[
        "You are a business analyst tasked with analyzing company data.",
        "For the provided company data, based on the thesis determine strategic fit:",
        "Thesis: Our acquisition thesis centers on identifying U.S.-based semiconductor manufacturers ...",
        "Final Output should be a single digit number between 1 and 10.",
        "ONLY OUTPUT THE FINAL NUMBER, DO NOT INCLUDE ANY INTERMEDIATE STEPS OR REASONING."
    ],
    tools=[
      GoogleSearchTools(),
      ReasoningTools(),
    ],
)

Loop over your filtered list, feed the company info as a string, and append the integer score:

for i, row in sheet_one_df.iterrows():
    company_info = ', '.join(f"{k}: {v}" for k, v in row.to_dict().items())
    res = agent.run(company_info)
    sheet_one_df.at[i, 'AI Rating (1-10)'] = int(res.content)

For reasoning mode, add instructions to output a short rationale alongside the numeric score:

instructions=[
    ...,
    "OUTPUT FORMAT:\nSTRATEGIC FIT (1-10): <digit>\nREASON: <your reason>"
]

Putting It All Together

Below is a simplified, end-to-end snippet:

# 1. Load and filter
import pandas as pd
from agno.agent import Agent
# ... other imports

df = pd.read_csv('data.csv')
df_filtered = filter_df(df, codes=[...])
df_keyword = crawl_websites(df_filtered, terms=['casting', 'foundry'])

# 2. Define thesis and AI agent
# (see code above)

# 3. Run AI scoring
for i, row in df_keyword.iterrows():
    info = ', '.join(f"{k}: {v}" for k, v in row.items())
    score = int(agent.run(info).content)
    df_keyword.at[i, 'AI Rating'] = score

# 4. Export
with pd.ExcelWriter('final_output.xlsx') as writer:
    df_keyword.to_excel(writer, sheet_name='Filtered', index=False)
    df.to_excel(writer, sheet_name='All', index=False)

You can find the full, detailed script on GitHub.

Conclusion & Next Steps

By combining DataAxle’s exhaustive dataset with code-based filters, keyword crawling, and AI-driven scoring, you can accelerate your target identification process while maintaining precision.

Next steps:

Integrate with your CRM or business intelligence tools.
Implement advanced NLP filters or topic modeling.
Build a simple GUI to make the tool accessible to non-technical colleagues.

DataAxleFilter: Automating Firm Data Extraction & Filtering

Introduction

Downloading Your Base Dataset from DataAxle

Filtering by NAICs vs Keyword

Code-only Limitations

Keyword Crawling

Core Python Pipeline

Adding AI-Driven Scoring

Setup the Agent

Putting It All Together

Conclusion & Next Steps

Comments

About Author

Sponsored Ad

Read Next

Octagram Composite Online Presence Score

CPR: A Change Model

Introduction

Downloading Your Base Dataset from DataAxle

Filtering by NAICs vs Keyword

Code-only Limitations

Keyword Crawling

Core Python Pipeline

Adding AI-Driven Scoring

Setup the Agent

Putting It All Together

Conclusion & Next Steps

Comments

About Author

Sponsored Ad

Read Next

Octagram Composite Online Presence Score

CPR: A Change Model

Stay in the Loop