For the next few weeks, I’ve decided to share with you the journey of building Limpid.cloud.
Once a week, I intend to present a feature I’m working on and answer questions such as:
- Why did I choose to work on that?
- Who needs this feature?
- How did I build it?
- Which challenges did I encounter?
- What’s the initial feedback from my users/customers?
etc...
Committing to doing this once a week will push me to ship features & iterate FAST, which is absolutely key when building a startup.
I spent this week building a dataset of 250,000+ SaaS, Cloud, and IT Providers.
All Limpid.cloud users first start by automatically listing all the SaaS & Cloud providers used within their company. To do this reliably, I need three components:
An exhaustive dataset of all the existing providers in the world
Connections with as many data sources as possible within the company: Google Workspace, Web browsers, Accounting, Bank account, etc..
A strong algorithm to accurately match the raw data from the sources with the providers that I have in the dataset.
🧐 Why I prioritised this: Since the initial "discovery" phase is the most important aspect of Limpid, I need the dataset to have as many providers as possible. I used to have 3,000 of those, which was far from being enough!
👨💼 Who is this for: All users of Limpid (on Free or Paid plans).
🛠️ How I built it:
- I initially thought about buying a premade dataset of SaaS providers (there are plenty available on the web), but those were out of my budget. Also, I feared that they were not exhaustive enough and that I would have to buy them again in a few months to have up-to-date data.
- I also thought about scraping popular SaaS websites such as G2 or Capterra, but they didn't seem exhaustive enough for my use case. Also, doing that would be quite expensive and against their terms of service.
- After looking at different options, I ended up finding a variety of Open Source / Free data sources about all companies in the world, which I joined together and filtered/cleaned using a powerful data transformation tool: Acho.
🤯 What I struggled with: To my stupefaction, it turns out that Excel cannot handle files of more than 1m rows, which is ridiculously low in today's world of big data (this limit hasn't changed since 2007!). I started from a CSV file with 17m rows, which forced me to look at other options.
🔜 Next steps: Now that I have an exhaustive dataset, I'll improve the "matching" algorithm to ensure we don't have too many false positives or false negatives.
I'll see you all next week! ✌️
- Valentin