How is the data recollected from the videos?
I used python with the following libraries
Library | Reason |
---|---|
Pytube | Used to extract metadata from Youtube videos and download it |
numpy | |
cv2 | Used VideoCapture to read frames from video |
Pillow | Image manipulation, needed to crop images before passes to OCR software |
pytesseract | Python wrapper to use Tesseract-OCR osftware |
re | Regular expressions |
scrapetube | To get all the identificator from a Youtube Channel |
json | Dump and load data in json format |
The proceddure*
*Main idea
First i get all the videos uploaded in the DGR Channel video
Then i check if i searched data from the current video, if I used it in the past, the video is skiped
Download the video to analyse the frames
After the video is downloaded, a frame is stracted from the video and is analiced
The frame is croped in two special places, where the level code is located
A OCR (optical character recognition) is made over the cropped images
Because OCR use some weird strategic to reconoce letters, a REGEX is used to filter the character recognition
The LEVEL code is stracted and saved
The code extracted is used in the webpage WEBPAGE to retrieve all the metadata of the level
Finally, all the data is saved with the Youtube metadata [Thumbnail, Description, URL]