Last month, I worked on the initial version of the data-pipeline for CodeKN.com’s text-processor. At the current version, it fetches URLs, does basic validation for eligible content, and then passes it for further text-analysis.
Each of these rectangles on the image above is a separate step in the pipeline. To connect these steps, I use the PubSub pattern within the same app’s instance. For example, whenever there is a new URL, the responsible sub-module will publish a message on the topic.
Other modules, for example, the URL-matcher (rule-based validation engine, which will check each given URL if it’s eligible as an engineering article), subscribes to the needed topic. When there is a new message, it will process it based on internal logic.
In turn, this sub-system can publish another message to another topic for further processing down in the pipeline.
You can read more about PubSub implementation here:
By using this pipeline the system can produce the following data:
- The initial pool of URLs.
- Parse URL into sub-parts (host, scheme, path) for further validation.
- Validate each given URL for eligibility.
- Fetch and save page’s content for eligible URLs.
- Record and mark the eligible for text-analysis.
Each of these steps produces new data. For the sake of understanding, I split data into separate scopes:
The scope of the data identifies the completeness of the particular URL. For example, let’s say URL example.com/abc1 was fetched, parsed, and marked as eligible. But before the content of the URL downloaded and its meta and refs parsed, this URL can not be used for any further processing (for example topic identification or text-analysis).
Another nuance is the enrichment of the data based on the initial source—for example, tags in RSS feed or URL’s priority in sitemap.xml. We can also use these attributes for text-analysis.
But, topics of data completeness and data enrichment is something that I want to cover as part of another post. Stay tuned for the next updates.