Categories
Building CodeKN.com Coding Session Videos

[Video] GZIP compress web page’s content and save in MySQL using GORM (Golang)

In the attached video I discuss the following topics:

  1. Save links and images from the webpage.
  2. Mark URL as complete in the pipeline once it is fully parsed.
  3. Refactor: extract text compressor to the separate function (similar to decompressor).
  4. To-do: define a task for the “webpage data parser” worker.

Compress using GZIP

We retrieve the web page content as a text body. Since we expect to have many URLs saved locally, it would be optimal to compress this date; thus, it will take less storage. By my benchmarks, we can see an average of 60-85 % compression level. See the video for examples.

I use gzip (compress/gzip) to compress the text. In this video, I refactored text compression to a separate function. I then executed the whole pipeline to see if the webpage’s data fetched and saved in the compressed version (manual test).

Here is the main gzip function:

func gzipWrite(w io.Writer, respBody []byte) error {
	var err error
	gz := gzip.NewWriter(w)

	if _, err = gz.Write(respBody); err != nil {
		return err
	}
	if err = gz.Close(); err != nil {
		return err
	}
	return nil
}

In this section, we will introduce two new models: URLLink and URLImage. These tables aim to keep reference information from the given web page.

type URLLink struct {
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt *time.Time `sql:"index"`

	URL       string `gorm:"index:idx_url;not null"`
	LinkURL   string `gorm:"not null"`
	LinkTitle string
}

func (URLLink) TableName() string {
	return TablePrefix + "url_links"
}

type URLImage struct {
	CreatedAt time.Time
	UpdatedAt time.Time
	DeletedAt *time.Time `sql:"index"`

	URL        string `gorm:"index:idx_url;not null"`
	ImageURL   string `gorm:"not null"`
	ImageTitle string
}

func (URLImage) TableName() string {
	return TablePrefix + "url_images"
}

By Kanan Rahimov

Sr. Software Engineer

Leave a Reply