Extract information from websites, PDF or screenshots

Kantu's screen scraping solution allows you to visually mark the data that you want to extract ("scrape"). You simply draw pink frame(s) around the data that you need. Kantu then retrieves the data directly from the HTML source or extracts it visually by using high-quality OCR (Optical Character Recognition). The OCR approach works not only for web scraping, but also for PDF scraping, images (screen scraping) and videos.

Visual Web Scraping with Kantu in Chromium

This screenshot shows the Extraction wizard inside the Kantu Editor. Essentially this is a tiny graphical editor that allows you the draw, move and delete green and pink frames.

Web Scraping Step by Step

1. Record a screenshot

Record a screenshot around the area of your data. For this, you can use the recording wizard and record it for example as FIND command. Then, in the Kantu Editor, you change the FIND command to EXTRACT. Or you start directly in the editor: Add a line and select EXTRACT as command. The Extract wizard in the editor has a built-in screenshot tool.

2. Mark the extraction anchor and data-to-be-extracted

Now comes the magic. In the Extraction wizard of the editor you mark a fixed point as extraction anchor with a green frame. An extraction anchor will be something that does not change between the page loads, for example some fixed text like "Price:", a $ symbol or an image/icon. In the example template below we use the "BTC" (Bitcoin) symbol as anchor.

Next you draw pink frames around the data that you want to extract. What Kantu will do is to search the document and first find whatever is inside the green frame, and then extracts the content of the area of the pink frames. So a template always has one one green frame, but there can be many pink frames.

In this example we use the Bitcoin symbol as anchor (green frame), and then extract the Bitcoin exchange rate (pink frame).

In this example we find the USD/Euro symbols as anchor, and then extract six different exchange rates (six pink frames).

Important: The colors of the frames need to be exactly as shown here. That is color code #00FF00 for green, and #FE1492 for pink. If you draw the frames with the Kantu editor, this is automatically taken care of. If you use an external image editor such as Microsoft Paint or Snagit, make sure you use exactly these color codes for your frames. Make sure that the frames are one solid color, and the tools do not apply shadow effects, color gradients or similar visual effects. This could disturb the frame-finding logic:

You can verify that your frames are sharp and single color by zooming the image. Of course, you need this check only if you use external graphics software like Snagit or MS Paint. If you use the Kantu Editor to draw the frames they are automatically correct.

3. Get the data as text, html,... or with OCR

In the next step you need to tell Kantu how to extract the information.

A. TEXT, HTM, URL - Here Kantu gets the data from the web page element that is in the middle ("below") of the pink square.

Classic web scraping made easy - no DOM, ID, xpath or CSS selectors required

In the method field you define the extraction method: TEXT, HTM, URL - or OCR.

B. OCR - Kantu offers OCR-powered screen scraping. With Method=OCR Kantu runs OCR (optical character recognition) on the image that is inside the pink frame. It reads the image and then returns the text or number inside it. This OCR-powered data extraction does not only work on websites (web scraping) but it works equally well on PDFs (PDF scraping) and images (text on images or inside videos, like subtitles or news).

Web Scraping, Screen scraping, PDF Scraping, Image and Video Scraping with OCR

4. Storing the data via the API

The data is returned to the calling script via the GetExtractImageData() API command. This works with any programming or scripting language, and you can manage the further data processing and storage directly in your favoriate lanugage. Here is a VBS data scraping script. The data is returned as a string:

0.1[DATA]0.12[REC]Monday[DATA]Tuesday[DATA]

The different fields are separated by the [DATA] separator, and between the data from different EXTRACT commands there is the [REC] separator.

Every scripting language has some kind of "split" command. Here is how the syntax looks in Visual Basic Scripting (VBS): line = Split(allData, "[REC]") and data = Split(line(x), "[DATA]"). The data different fields are separated by the [DATA] separator, and between different EXTRACT command there is the [REC] separator.

For ready-to-run example code please see the Kantu Github repository at https://github.com/A9T9/Kantu (for example WebScraping-Data-Extraction.vbs) or simply ask our tech support.

5. Testing the macro

If you run a macro that includes one or several EXTRACT commands, the scraping result is displayed automatically:

Web Scraping with OCR

Kantu displays the extracted information as yellow text overlay during the macro run.

Top
Follow a9t9 on Twitter
Contact Us
Download Kantu Freeware
Get Copyfish for Chrome
Subscribe to the a9t9 automation software newsletter . We'll send you updates on new releases that we're working on.