Dumped the pdf to ppm >turned into greyscale tiffs > cropped the spreads using imagemagick > ran tesseract-ocr on the result. It's fairly messy but actually a little bit better than I thought. Needs serious spellchecking and more, the notes are totally f'ed. Have you ocr'd stuff before, how successful have you been?
I automated it all using pdf2ppm, imagemagick, tesseract and a *tiny* bit of shell scripting. (linux tools). Had to manually blank out the photos though and added spaces between paragraphs. Was using tesseract 2.04-2.1 on Debian Sid, might be worth while trying the latest version.
It was mainly an experiment and didn't take much more that 30mins do you reckon it's worth while to do this sort of stuff?
We have a lot of OCRed content, it's generally pretty painstaking and takes a long time correcting things, especially like to hear the quality of the scan isn't that great.
Especially with more people using Kindles and screen readers, it's great having stuff in text format. Would you be able to edit this article and paste in a formatted text version?
If you can do this for anything else in the library as well that would be amazing (in the PDFs tag), or write a short guide so that others can do it as well?
Perhaps I'm missing something here, but if you want to use a work which comrades in Ukraine and Canada have gone to a lot of trouble to prepare, why not ask them for a digital copy instead of going through this arcane procedure and possibly ending up with an inferior product?
Good point. Black Cat Press/ Thoughtcrime are very reasonable. They will most likely send the text file if asked (they did this with the Atamansha text I think).
"Perhaps I'm missing something here, but if you want to use a work which comrades in Ukraine and Canada have gone to a lot of trouble to prepare, why not ask them for a digital copy instead of going through this arcane procedure and possibly ending up with an inferior product?"
As I said, ask authors of texts first before puttting them up on libcom. Not only good manners, but as Kareltelnik says, can save a lot of bother.
Comments
Brilliant, thanks for this!
Brilliant, thanks for this!
Yeah, I've been pondering
Yeah, I've been pondering scanning this for ages - nice one!
Do your Anarchy's Cossack
Do your Anarchy's Cossack instead.
That's been donated to the
That's been donated to the Sparrow's Nest I'm afraid.
I see how it is, 'mate'.
I see how it is, 'mate'.
I did a quick test ocr'ing
I did a quick test ocr'ing the scan.
Dumped the pdf to ppm >turned into greyscale tiffs > cropped the spreads using imagemagick > ran tesseract-ocr on the result. It's fairly messy but actually a little bit better than I thought. Needs serious spellchecking and more, the notes are totally f'ed. Have you ocr'd stuff before, how successful have you been?
I automated it all using pdf2ppm, imagemagick, tesseract and a *tiny* bit of shell scripting. (linux tools). Had to manually blank out the photos though and added spaces between paragraphs. Was using tesseract 2.04-2.1 on Debian Sid, might be worth while trying the latest version.
It was mainly an experiment and didn't take much more that 30mins do you reckon it's worth while to do this sort of stuff?
Text can be found at http://pastebin.com/qxpJuf50
Hey, that's really great. We
Hey, that's really great.
We have a lot of OCRed content, it's generally pretty painstaking and takes a long time correcting things, especially like to hear the quality of the scan isn't that great.
Especially with more people using Kindles and screen readers, it's great having stuff in text format. Would you be able to edit this article and paste in a formatted text version?
If you can do this for anything else in the library as well that would be amazing (in the PDFs tag), or write a short guide so that others can do it as well?
Perhaps I'm missing something
Perhaps I'm missing something here, but if you want to use a work which comrades in Ukraine and Canada have gone to a lot of trouble to prepare, why not ask them for a digital copy instead of going through this arcane procedure and possibly ending up with an inferior product?
Good point. Black Cat Press/
Good point. Black Cat Press/ Thoughtcrime are very reasonable. They will most likely send the text file if asked (they did this with the Atamansha text I think).
"Perhaps I'm missing
"Perhaps I'm missing something here, but if you want to use a work which comrades in Ukraine and Canada have gone to a lot of trouble to prepare, why not ask them for a digital copy instead of going through this arcane procedure and possibly ending up with an inferior product?"
As I said, ask authors of texts first before puttting them up on libcom. Not only good manners, but as Kareltelnik says, can save a lot of bother.
Can anyone clarify if the
Can anyone clarify if the current PDF is the "flawed" one or if someone actually has contacted publishers and asked them for a text or PDF?