Web Scraping (Data Extraction)
Extract information from websites, PDF or screenshots
Kantu's screen scraping solution allows you to
visually mark the data that you want to extract ("scrape"). You simply draw pink frame(s) around the data that you need.
Kantu then retrieves the data directly from the HTML source or extracts it visually by using high-quality OCR (Optical Character Recognition).
The OCR approach works not only for web scraping, but also for PDF scraping, images (screen scraping) and videos.
This screenshot shows the Extraction wizard inside the Kantu Editor. Essentially this is a tiny graphical editor that allows you the draw, move and delete green and pink frames.
Web Scraping Step by Step
1. Record a screenshot
Record a screenshot around the area of your data. For this, you can use the recording wizard and record it for example as FIND command. Then, in the Kantu Editor, you change the FIND command to EXTRACT. Or you start directly in the editor: Add a line and select EXTRACT as command. The Extract wizard in the editor has a built-in screenshot tool.
2. Mark the extraction anchor and data-to-be-extracted
Now comes the magic. In the Extraction wizard of the editor you mark a fixed point as extraction anchor with a green frame. An extraction anchor will be something that does not change
between the page loads, for example some fixed text like "Price:", a $ symbol or an image/icon. In the example template below we use the "BTC" (Bitcoin) symbol as anchor.
Next you draw pink frames around the data that you want to extract. What Kantu will do is to search the document and first find whatever is inside the green frame, and then extracts the content of the area of the pink frames. So a template always has one one green frame, but there can be many pink frames.
In this example we use the Bitcoin symbol as anchor (green frame), and then extract the Bitcoin exchange rate (pink frame).
In this example we find the USD/Euro symbols as anchor, and then extract six different exchange rates (six pink frames).
Important: The colors of the frames need to be exactly as shown here. That is color code #00FF00 for green, and #FE1492 for pink.
If you draw the frames with the Kantu editor, this is automatically taken care of. If you use an external
image editor such as Microsoft Paint or Snagit, make sure you use exactly these color codes for your frames. Make sure that the frames are one solid color, and the tools
do not apply shadow effects, color gradients or similar visual effects. This could disturb the frame-finding logic:
You can verify that your frames are sharp and single color by zooming the image. Of course, you need this check only if you use external graphics software like Snagit or MS Paint. If you use the Kantu Editor to draw the frames they are automatically correct.
3. Get the data as text, html, image, table... or with OCRIn the next step you need to tell Kantu how to extract the information.
TEXT, HTM, URL - Here Kantu gets the data from the web page element that is in the middle ("below") of the
In the method field you define the extraction method: TEXT, HTM, URL - or OCR.
IMAGE - Have some hard to download images? Or want to save a snapshot of a certain part of the website?
Kantu image extraction does that. You mark the area that you want to extract with a pink frame and use IMAGE as extraction mode.
Then Kantu saves a snapshot (screenshot) of the part inside the pink frame. This can be an image, text, video, PDF or anything else. Kantu "screenshots" the part inside
the pink frame and saves it as image. You can have more than one pink frame, too. Kantu saves the images in the download folder with a
default name macro name_image name_(number)_ImageExtract.png
TABLE - Extract a complete HTML table at once, converted automatically to a ready-to-use CSV format.
To do so, point the location of the pink box to anything inside the table – this will tell Kantu which table to you want. Kantu marks the table during the extraction
with an orange frame, so you can visually verify it selected the correct table. The table data is saved to the Kantu download folder which a file name of
OCR - Kantu offers OCR-powered screen scraping. With Method=OCR Kantu runs OCR (optical character recognition) on the image that is inside the pink frame. It reads the image and then returns the text or number inside it.
This OCR-powered data extraction does not only work on websites (web scraping) but it works equally well on PDFs (PDF scraping) and images (text on images or inside videos, like subtitles or news).
All the different extraction options listed in this chapter are demo'ed with the "Demo-Extract" example macro that ships with Kantu.
4. Where is the extracted data?
All extracted data is saved in CSV format in the Kantu /download folder. In addition, you can use the API to access the extracted data directly (see below).
5. Access the data via the API [PRO]
The data is returned to the calling script via the
GetExtractImageData() API command. This works with any programming or scripting language, and you
can manage the further data processing and storage directly in your favoriate lanugage. Here is a VBS data scraping script.
The data is returned as a string:
The different fields are separated by the [DATA] separator.
Every scripting language has some kind of "split" command. Here is how the syntax looks in Visual Basic Scripting (VBS):
data = Split(line(x), "[DATA]").
The data different fields are separated by the [DATA] separator.
5. Testing the macro
If you run a macro that includes one or several EXTRACT commands, the scraping result is displayed automatically:
Kantu displays the extracted information as yellow text overlay during the macro run.Top