QickStart#
This is a quick start guide to get you up and running with the CleverDoc
library.
Dependencies#
Python 3.9
PyTorch 1.11+
Spark 3.4.0+
Onnxruntime 1.10.0+
Transformers 4.11.3+
Installation#
The following command installs CleverDoc from the Python Package Index. You will need a working installation of Python 3.9 and pip.
pip install cleverdoc[inference]
Quick tour#
Start Spark session with CleverDoc#
from cleverdoc import *
spark = start(license='your_license_key')
Load an image to Spark DataFrame#
import pkg_resources
doc_example = pkg_resources.resource_filename('cleverdoc', 'resources/images/Personal_Health_Record_Example.png')
df = spark.read.format("binaryFile").load(doc_example)
Show the image
show_images(df, "content")
Define Spark ML pipeline for extract text#
from pyspark.ml.pipeline import PipelineModel
pipeline = PipelineModel(stages=[
BinaryToImage(),
ImageToString()
])
Run the pipeline#
text_df = pipeline.transform(df)
text_df.show()
Get extracted text#
print(text_df.select("text.text").collect()[0][0])