Categories
Golang

Golang HTML Tokenizer: extract text from a web page

Using Golang HTML tokenizer allows us to parse a web page and distinguish elements like tags, text data, comment, and doctype. By using it, we can get only text information or just self-closing tags.

Golang HTML Tokenizer: extract text from a web page

One thing is to load the content of a web page, and another is to extract some valuable information from it. For the second one, sometimes, it is helpful and needed to get only the textual information. Because most of the time, it is the particular text that we are interested in. And, one of the ways to do it in Golang is to use HTML Tokenizer.

In Go, there is a sub-repository package called html which implements HTML5-compliant tokenizer. By using this package, it’s possible to retrieve information about the page in the form of tokens – tag names, attributes, and text data.

Fetch webpage and parse just text content

The following simple example will fetch the given URL and print only the text content of each related tag:

package main

import (
	"fmt"
	"io"
	"log"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	response, err := http.Get("https://kenanbek.github.io/about")
		if err != nil {
		log.Fatal(err)
	}
	defer response.Body.Close()

	tokenizer := html.NewTokenizer(response.Body)
	for {
		tt := tokenizer.Next()
		t := tokenizer.Token()

		err := tokenizer.Err()
		if err == io.EOF {
			break
		}

		switch tt {
		case html.ErrorToken:
			log.Fatal(err)
		case html.TextToken:
			data := strings.TrimSpace(t.Data)
			fmt.Println(data)
		}
	}
}

The above example is quite straight forward:

  1. Here, http.Get("https://kenanbek.github.io/about"), we load the content of the web page.
  2. If there is no error, we initialize a new tokenizer with the body of the response: tokenizer := html.NewTokenizer(response.Body)
  3. Iterate through each token and check the token type. To do so, we use tokenizer.Next() to fetch the next token, and tokenizer.Token() to get the additional information about the current token.
  4. tokenizer.Next() returns a type of the current token, which helps us identify if it is Error, Opening or Closing tag, or a Text token (there is also token type for Comment, Doctype, Self-closing tokens).
  5. If the token is a text token, using data := strings.TrimSpace(t.Data) we can get the text data of the token.

Function tokenizer.Next() scans the next token and returns its type when function tokenizer.Token() returns the current token. The result’s Data and Attr values remain valid after subsequent Next calls. For more on this, please refer to the official documentation here – type Tokenizer.

How to get content of only text tags?

Our above example will return all text data, including also scripts, meta tags, etc. Usually, this is not what we want as the textual content of the web site. For the indication of text tags, we can refer to this and this sources. My short list of text only tags is this – “a”, “p”, “span”, “em”, “string”, “blockquote”, “q”, “cite”, “h1”, “h2”, “h3”, “h4”, “h5”, “h6”.

Our goal would be to create a function that will mark the start of the text-only tags, and only get text data for these marked tags. We can achieve this by using html.StartTagToken and html.SelfClosingTagToken token types.

package main

import (
	"fmt"
	"io"
	"log"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	response, err := http.Get("https://kenanbek.github.io/about")
		if err != nil {
		log.Fatal(err)
	}
	defer response.Body.Close()
	
	textTags := []string{
		"a",
		"p", "span", "em", "string", "blockquote", "q", "cite",
		"h1", "h2", "h3", "h4", "h5", "h6",
	}

	tag := ""
	enter := false

	tokenizer := html.NewTokenizer(response.Body)
	for {
		tt := tokenizer.Next()
		token := tokenizer.Token()

		err := tokenizer.Err()
		if err == io.EOF {
			break
		}

		switch tt {
		case html.ErrorToken:
			log.Fatal(err)
		case html.StartTagToken, html.SelfClosingTagToken:
			enter = false

			tag = token.Data
			for _, ttt := range textTags {
				if tag == ttt {
					enter = true
					break
				}
			}
		case html.TextToken:
			if enter {
				data := strings.TrimSpace(token.Data)

				if len(data) > 0 {
					fmt.Println(data)
				}
			}
		}
	}
}

Handling errors with ErrorToken or tokenizer.Err()

As we can see in both examples, error handling is done through tokenizer.Err() function. And most of the time, this will indicate the end of file (EOF) error, meaning that we should stop looking for the next token. For safety reasons, I also added the check for the token being an error token – html.ErrorToken – in which case we log it as a fatal error.

Generally, as documentation states tokenizer.Err() function returns the error associated with the most recent ErrorToken token.

Video demo: Go HTML Tokenizer

An example web site parser to convert the web page’s content into a simple markdown using Golang HTML Tokenizer:

Demo video presents a simple script that transforms a web page’s content into a simple markdown by using Golang HTML tokenizer.

The source code in the video has only one major difference – or feature we shall call it – based on the tag names, it can transform an HTML into a very basic markdown. Meaning that headers like h1, h2, h3 will end up being “##” in markdown, headers like h4, h5, h6 -> “###”, and links become Markdown link.

package main

import (
	"fmt"
	"io"
	"log"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	response, err := http.Get("https://kenanbek.github.io/about")
		if err != nil {
		log.Fatal(err)
	}
	defer response.Body.Close()
	
	textTags := []string{
		"a",
		"p", "span", "em", "string", "blockquote", "q", "cite",
		"h1", "h2", "h3", "h4", "h5", "h6",
	}

	tag := ""
	attrs := map[string]string{}
	enter := false

	tokenizer := html.NewTokenizer(response.Body)
	for {
		tt := tokenizer.Next()
		token := tokenizer.Token()

		err := tokenizer.Err()
		if err == io.EOF {
			break
		}

		switch tt {
		case html.ErrorToken:
			log.Fatal(err)
		case html.StartTagToken, html.SelfClosingTagToken:
			enter = false
			attrs = map[string]string{}

			tag = token.Data
			for _, ttt := range textTags {
				if tag == ttt {
					enter = true
					for _, attr := range token.Attr {
						attrs[attr.Key] = attr.Val
					}
					break
				}
			}
		case html.TextToken:
			if enter {
				data := strings.TrimSpace(token.Data)

				if len(data) > 0 {
					switch tag {
					case "a":
						fmt.Printf("[%s](%s)\n", data, attrs["href"])
					case "h1", "h2", "h3":
						fmt.Println("## ", data)
					case "h4", "h5", "h6":
						fmt.Println("### ", data)
					default:
						fmt.Println(data)
					}
				}
			}
		}
	}
}

Parsing attributes

Another interesting difference in the above code is the usage of attributes:

for _, attr := range token.Attr {
	attrs[attr.Key] = attr.Val
}

Here, we use token.Attr field to iterate over the attributes of the current token. Each attribute itself represented by the Attribute struct. The main two fields of this type are – Key and Value – Key for attribute name and Value for the value. Basically for the below example HTML script:

<a href="https://kenanbek.github.io">Kenan Bek</a>

token.Attr would return a list of one element with Key=href, and value=https://kenanbek.github.io.

By Kanan Rahimov

Sr. Software Engineer

Leave a Reply