Web Scraping (Data Extraction)
Extract information from websites, PDF or screenshots
Kantu's screen scraping solution allows you to
visually mark the data that you want to extract ("scrape"). You simply draw pink frame(s) around the data that you need.
Kantu then retrieves the data directly from the HTML source or extracts it visually by using high-quality OCR (Optical Character Recognition).
The OCR approach works not only for web scraping, but also for PDF scraping, images (screen scraping) and videos.
This screenshot shows the Extraction wizard inside the Kantu Editor. Essentially this is a tiny graphical editor that allows you the draw, move and delete green and pink frames.
Web Scraping Step by Step
1. Record a screenshot
Record a screenshot around the area of your data. For this, you can use the recording wizard and record it for example as FIND command. Then, in the Kantu Editor, you change the FIND command to EXTRACT. Or you start directly in the editor: Add a line and select EXTRACT as command. The Extract wizard in the editor has a built-in screenshot tool.
2. Mark the extraction anchor and data-to-be-extracted
Now comes the magic. In the Extraction wizard of the editor you mark a fixed point as extraction anchor with a green frame. An extraction anchor will be something that does not change
between the page loads, for example some fixed text like "Price:", a $ symbol or an image/icon. In the example template below we use the "BTC" (Bitcoin) symbol as anchor.
Next you draw pink frames around the data that you want to extract. What Kantu will do is to search the document and first find whatever is inside the green frame, and then extracts the content of the area of the pink frames. So a template always has one one green frame, but there can be many pink frames.
In this example we use the Bitcoin symbol as anchor (green frame), and then extract the Bitcoin exchange rate (pink frame).
In this example we find the USD/Euro symbols as anchor, and then extract six different exchange rates (six pink frames).
Important: The colors of the frames need to be exactly as shown here. That is color code #00FF00 for green, and #FE1492 for pink.
If you draw the frames with the Kantu editor, this is automatically taken care of. If you use an external
image editor such as Microsoft Paint or Snagit, make sure you use exactly these color codes for your frames. Make sure that the frames are one solid color, and the tools
do not apply shadow effects, color gradients or similar visual effects. This could disturb the frame-finding logic:
You can verify that your frames are sharp and single color by zooming the image. Of course, you need this check only if you use external graphics software like Snagit or MS Paint. If you use the Kantu Editor to draw the frames they are automatically correct.
3. Get the data as text, html,... or with OCRIn the next step you need to tell Kantu how to extract the information.
TEXT, HTM, URL - Here Kantu gets the data from the web page element that is in the middle ("below") of the
In the method field you define the extraction method: TEXT, HTM, URL - or OCR.
OCR - Kantu offers OCR-powered screen scraping. With Method=OCR Kantu runs OCR (optical character recognition) on the image that is inside the pink frame. It reads the image and then returns the text or number inside it.
This OCR-powered data extraction does not only work on websites (web scraping) but it works equally well on PDFs (PDF scraping) and images (text on images or inside videos, like subtitles or news).
4. Storing the data via the API
The data is returned to the calling script via the
GetExtractImageData() API command. This works with any programming or scripting language, and you
can manage the further data processing and storage directly in your favoriate lanugage. Here is a VBS data scraping script.
The data is returned as a string:
The different fields are separated by the [DATA] separator, and between the data from different EXTRACT commands there is the [REC] separator.
Every scripting language has some kind of "split" command. Here is how the syntax looks in Visual Basic Scripting (VBS):
line = Split(allData, "[REC]") and
data = Split(line(x), "[DATA]").
The data different fields are separated by the [DATA] separator, and between different EXTRACT command there is the [REC] separator.
5. Testing the macro
If you run a macro that includes one or several EXTRACT commands, the scraping result is displayed automatically:
Kantu displays the extracted information as yellow text overlay during the macro run.Top