Python client for learningOrchestra.
Requires Python 3.x
pip install learning-orchestra-client
Import learning_orchestra_client:
from learning_orchestra_client import *Create a Context object passing an IP from your cluster :
cluster_ip = "34.95.222.197"
Context(cluster_ip)After creating the Context object, you will be able to use learningOrchestra.
Each functionality in learningOrchestra is contained in its own class. Check below for all the available function APIs.
Shown below is an example usage of learning-orchestra-client using the Titanic Dataset:
from learning_orchestra_client import *
cluster_ip = "34.95.187.26"
Context(cluster_ip)
database_api = DatabaseApi()
print(database_api.create_file(
"titanic_training",
"https://filebin.net/rpfdy8clm5984a4c/titanic_training.csv?t=gcnjz1yo"))
print(database_api.create_file(
"titanic_testing",
"https://filebin.net/mguee52ke97k0x9h/titanic_testing.csv?t=ub4nc1rc"))
print(database_api.read_resume_files())
projection = Projection()
non_required_columns = ["Name", "Ticket", "Cabin",
"Embarked", "Sex", "Initial"]
print(projection.create("titanic_training",
"titanic_training_projection",
non_required_columns))
print(projection.create("titanic_testing",
"titanic_testing_projection",
non_required_columns))
data_type_handler = DataTypeHandler()
type_fields = {
"Age": "number",
"Fare": "number",
"Parch": "number",
"PassengerId": "number",
"Pclass": "number",
"SibSp": "number"
}
print(data_type_handler.change_file_type(
"titanic_testing_projection",
type_fields))
type_fields["Survived"] = "number"
print(data_type_handler.change_file_type(
"titanic_training_projection",
type_fields))
preprocessing_code = '''
from pyspark.ml import Pipeline
from pyspark.sql.functions import (
mean, col, split,
regexp_extract, when, lit)
from pyspark.ml.feature import (
VectorAssembler,
StringIndexer
)
TRAINING_DF_INDEX = 0
TESTING_DF_INDEX = 1
training_df = training_df.withColumnRenamed('Survived', 'label')
testing_df = testing_df.withColumn('label', lit(0))
datasets_list = [training_df, testing_df]
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn(
"Initial",
regexp_extract(col("Name"), "([A-Za-z]+)\.", 1))
datasets_list[index] = dataset
misspelled_initials = [
'Mlle', 'Mme', 'Ms', 'Dr',
'Major', 'Lady', 'Countess',
'Jonkheer', 'Col', 'Rev',
'Capt', 'Sir', 'Don'
]
correct_initials = [
'Miss', 'Miss', 'Miss', 'Mr',
'Mr', 'Mrs', 'Mrs',
'Other', 'Other', 'Other',
'Mr', 'Mr', 'Mr'
]
for index, dataset in enumerate(datasets_list):
dataset = dataset.replace(misspelled_initials, correct_initials)
datasets_list[index] = dataset
initials_age = {"Miss": 22,
"Other": 46,
"Master": 5,
"Mr": 33,
"Mrs": 36}
for index, dataset in enumerate(datasets_list):
for initial, initial_age in initials_age.items():
dataset = dataset.withColumn(
"Age",
when((dataset["Initial"] == initial) &
(dataset["Age"].isNull()), initial_age).otherwise(
dataset["Age"]))
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.na.fill({"Embarked": 'S'})
datasets_list[index] = dataset
for index, dataset in enumerate(datasets_list):
dataset = dataset.withColumn("Family_Size", col('SibSp')+col('Parch'))
dataset = dataset.withColumn('Alone', lit(0))
dataset = dataset.withColumn(
"Alone",
when(dataset["Family_Size"] == 0, 1).otherwise(dataset["Alone"]))
datasets_list[index] = dataset
text_fields = ["Sex", "Embarked", "Initial"]
for column in text_fields:
for index, dataset in enumerate(datasets_list):
dataset = StringIndexer(
inputCol=column, outputCol=column+"_index").\
fit(dataset).\
transform(dataset)
datasets_list[index] = dataset
non_required_columns = ["Name", "Embarked", "Sex", "Initial"]
for index, dataset in enumerate(datasets_list):
dataset = dataset.drop(*non_required_columns)
datasets_list[index] = dataset
training_df = datasets_list[TRAINING_DF_INDEX]
testing_df = datasets_list[TESTING_DF_INDEX]
assembler = VectorAssembler(
inputCols=training_df.columns[:],
outputCol="features")
assembler.setHandleInvalid('skip')
features_training = assembler.transform(training_df)
(features_training, features_evaluation) =\
features_training.randomSplit([0.8, 0.2], seed=33)
features_testing = assembler.transform(testing_df)
'''
model_builder = Model()
print(model_builder.create_model(
"titanic_training_projection",
"titanic_testing_projection",
preprocessing_code,
["lr", "dt", "gb", "rf", "nb"]))read_resume_files(pretty_response=True)pretty_response: returns indentedstringfor visualization(default:True, returnsdictifFalse) (defaultTrue, ifFalse, return dict)
read_file(filename, skip=0, limit=10, query={}, pretty_response=True)filename: name of fileskip: number of rows to skip in pagination(default:0)limit: number of rows to return in pagination(default:10) (maximum is set at20rows per request)query: query to make in MongoDB(default:empty query)pretty_response: returns indentedstringfor visualization(default:True, returnsdictifFalse)
create_file(filename, url, pretty_response=True)filename: name of file to be createdurl: url to CSV filepretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
delete_file(filename, pretty_response=True)filename: name of the file to be deletedpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
create_projection(filename, projection_filename, fields, pretty_response=True)filename: name of the file to make projectionprojection_filename: name of file used to create projectionfields: list with fields to make projectionpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
change_file_type(filename, fields_dict, pretty_response=True)filename: name of filefields_dict: dictionary withfield:numberorfield:stringkeyspretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
create_histogram(filename, histogram_filename, fields,
pretty_response=True)filename: name of file to make histogramhistogram_filename: name of file used to create histogramfields: list with fields to make histogrampretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)parent_filename: name of file to make histogramtsne_filename: name of file used to create image plotlabel_name: label name to dataset with labeled tuples (default:None, to datasets without labeled tuples)pretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
read_image_plot_filenames(pretty_response=True)pretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
read_image_plot(tsne_filename, pretty_response=True)- tsne_filename: filename of a created image plot
pretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
delete_image_plot(tsne_filename, pretty_response=True)tsne_filename: filename of a created image plotpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
create_image_plot(tsne_filename, parent_filename,
label_name=None, pretty_response=True)parent_filename: name of file to make histogrampca_filename: filename used to create image plotlabel_name: label name to dataset with labeled tuples (default:None, to datasets without labeled tuples)pretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
read_image_plot_filenames(pretty_response=True)pretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
read_image_plot(pca_filename, pretty_response=True)pca_filename: filename of a created image plotpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
delete_image_plot(pca_filename, pretty_response=True)pca_filename: filename of a created image plotpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
create_model(training_filename, test_filename, preprocessor_code,
model_classificator, pretty_response=True)training_filename: name of file to be used in trainingtest_filename: name of file to be used in testpreprocessor_code: Python3 code for pyspark preprocessing modelmodel_classificator: list of initial classificators to be used in modelpretty_response: returns indentedstringfor visualization (default:True, returnsdictifFalse)
lr: LogisticRegressiondt: DecisionTreeClassifierrf: RandomForestClassifiergb: Gradient-boosted tree classifiernb: NaiveBayes
to send a request with LogisticRegression and NaiveBayes Classifiers:
create_model(training_filename, test_filename, preprocessor_code, ["lr", "nb"])The Python 3 preprocessing code must use the environment instances as below:
training_df(Instantiated): Spark Dataframe instance training filenametesting_df(Instantiated): Spark Dataframe instance testing filename
The preprocessing code must instantiate the variables as below, all instances must be transformed by pyspark VectorAssembler:
features_training(Not Instantiated): Spark Dataframe instance for training the modelfeatures_evaluation(Not Instantiated): Spark Dataframe instance for evaluating trained modelfeatures_testing(Not Instantiated): Spark Dataframe instance for testing the model
In case you don't want to evaluate the model, set features_evaluation as None.
self.fields_from_dataframe(dataframe, is_string)This method returns string or number fields as a string list from a DataFrame.
dataframe: DataFrame instanceis_string: Boolean parameter(ifTrue, the method returns the string DataFrame fields, otherwise, returns the numbers DataFrame fields)
