Extract information from websites, PDF or screenshots

Web Scraping Step by Step

1. Record a screenshot

Record a screenshot around the area of your data. For this, you can use the recording wizard and record it for example as FIND command. Then, in the Kantu Editor, you change the FIND command to EXTRACT. Or you start directly in the editor: Add a line and select EXTRACT as command. The Extract wizard in the editor has a built-in screenshot tool.

2. Mark the extraction anchor and data-to-be-extracted

Now comes the magic. In the Extraction wizard of the editor you mark a fixed point as extraction anchor with a green frame. An extraction anchor will be something that does not change between the page loads, for example some fixed text like "Price:", a $ symbol or an image/icon. In the example template below we use the "BTC" (Bitcoin) symbol as anchor.

Next you draw pink frames around the data that you want to extract. What Kantu will do is to search the document and first find whatever is inside the green frame, and then extracts the content of the area of the pink frames. So a template always has one one green frame, but there can be many pink frames.

In this example we use the Bitcoin symbol as anchor (green frame), and then extract the Bitcoin exchange rate (pink frame).

In this example we find the USD/Euro symbols as anchor, and then extract six different exchange rates (six pink frames).

Important: The colors of the frames need to be exactly as shown here. That is color code #00FF00 for green, and #FE1492 for pink. If you draw the frames with the Kantu editor, this is automatically taken care of. If you use an external image editor such as Microsoft Paint or Snagit, make sure you use exactly these color codes for your frames. Make sure that the frames are one solid color, and the tools do not apply shadow effects, color gradients or similar visual effects. This could disturb the frame-finding logic:

You can verify that your frames are sharp and single color by zooming the image. Of course, you need this check only if you use external graphics software like Snagit or MS Paint. If you use the Kantu Editor to draw the frames they are automatically correct.

3. Get the data as text, html, image, table... or with OCR

In the next step you need to tell Kantu how to extract the information.

A. TEXT, HTM, URL - Here Kantu gets the data from the web page element that is in the middle ("below") of the pink square.

In the method field you define the extraction method: TEXT, HTM, URL - or OCR.

B. IMAGE - Have some hard to download images? Or want to save a snapshot of a certain part of the website? Kantu image extraction does that. You mark the area that you want to extract with a pink frame and use IMAGE as extraction mode. Then Kantu saves a snapshot (screenshot) of the part inside the pink frame. This can be an image, text, video, PDF or anything else. Kantu "screenshots" the part inside the pink frame and saves it as image. You can have more than one pink frame, too. Kantu saves the images in the download folder with a default name macro name_image name_(number)_ImageExtract.png

C. TABLE - Extract a complete HTML table at once, converted automatically to a ready-to-use CSV format. To do so, point the location of the pink box to anything inside the table – this will tell Kantu which table to you want. Kantu marks the table during the extraction with an orange frame, so you can visually verify it selected the correct table. The table data is saved to the Kantu download folder which a file name of macro name_ image name_TableExtract.csv

D. OCR - Kantu offers OCR-powered screen scraping. With Method=OCR Kantu runs OCR (optical character recognition) on the image that is inside the pink frame. It reads the image and then returns the text or number inside it. This OCR-powered data extraction does not only work on websites (web scraping) but it works equally well on PDFs (PDF scraping) and images (text on images or inside videos, like subtitles or news).

Classic web scraping made easy - no DOM, ID, xpath or CSS selectors required

Web Scraping, Screen scraping, PDF Scraping, Image and Video Scraping with OCR

All the different extraction options listed in this chapter are demo'ed with the "Demo-Extract" example macro that ships with Kantu.

4. Where is the extracted data?

All extracted data is saved in CSV format in the Kantu /download folder. In addition, you can use the API to access the extracted data directly (see below).

5. Access the data via the API [PRO]

The data is returned to the calling script via the GetExtractImageData() API command. This works with any programming or scripting language, and you can manage the further data processing and storage directly in your favoriate lanugage. Here is a VBS data scraping script. The data is returned as a string:

0.1[DATA]0.12[DATA]Monday[DATA]Tuesday[DATA]

The different fields are separated by the [DATA] separator.

Every scripting language has some kind of "split" command. Here is how the syntax looks in Visual Basic Scripting (VBS): data = Split(line(x), "[DATA]"). The data different fields are separated by the [DATA] separator.

For ready-to-run example code please see the Kantu Github repository at https://github.com/A9T9/Kantu (for example WebScraping-Data-Extraction.vbs) or simply ask our tech support.

5. Testing the macro

If you run a macro that includes one or several EXTRACT commands, the scraping result is displayed automatically:

Web Scraping with OCR

Kantu displays the extracted information as yellow text overlay during the macro run.

Top
Follow a9t9 on Twitter
Contact Us
Download Kantu Freeware
Copyfish for Chrome/Firefox
Subscribe to the a9t9 automation software newsletter . We'll send you updates on new releases that we're working on.