Page 1 of 1
OCR of tables (e.g. WTTs) 12/01/2023 at 14:18 #150131 | |
DonRiver
166 posts |
Was wondering if anyone's had a go at using OCR to parse scanned timetables, e.g. those in Network Rail's archive? Just looking at Tesseract OCR's documentation (tesseract-ocr.github.io) - it's designed for reading paragraphs of text, not tables - wondering if there's off-the-shelf image processing techniques for recognising each column by its borders, cropping it out of the image, and OCR'ing it in isolation… it _might_ not actually be difficult in Python (named for the one in Tasmania, not in Russia) Log in to reply |
OCR of tables (e.g. WTTs) 12/01/2023 at 16:08 #150132 | |
bill_gensheet
1413 posts |
No, but just tried to see how it would go: https://www.onlineocr.net/pdftoexcel Seemed quite good except for dealing with times ending ½ which went to % or 1/2. While fixing the % is easy, 11/221/2 is more complicated to get to 11/22 ½ However that was a 2015 file, which looked like it was printed to pdf rather than scanned. Log in to reply The following user said thank you: DonRiver |