TypedDataset: Feature Overview

This tutorial introduces TypedDataset using a simple example. The following imports are needed to make all code examples compile.

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import frameless.functions.aggregate._
import frameless.TypedDataset

val conf = new SparkConf().setMaster("local[*]").setAppName("Frameless repl").set("spark.ui.enabled", "false")
implicit val spark = SparkSession.builder().config(conf).appName("REPL").getOrCreate()
spark.sparkContext.setLogLevel("WARN")

import spark.implicits._

Creating TypedDataset instances

We start by defining a case class:

case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)

And few Apartment instances:

val apartments = Seq(
  Apartment("Paris", 50,  300000.0, 2),
  Apartment("Paris", 100, 450000.0, 3),
  Apartment("Paris", 25,  250000.0, 1),
  Apartment("Lyon",  83,  200000.0, 2),
  Apartment("Lyon",  45,  133000.0, 1),
  Apartment("Nice",  74,  325000.0, 3)
)

We are now ready to instantiate a TypedDataset[Apartment]:

val aptTypedDs = TypedDataset.create(apartments)
// aptTypedDs: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

We can also create one from an existing Spark Dataset:

val aptDs = spark.createDataset(apartments)
// aptDs: org.apache.spark.sql.Dataset[Apartment] = [city: string, surface: int ... 2 more fields]
val aptTypedDs = TypedDataset.create(aptDs)
// aptTypedDs: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

Or use the Frameless syntax:

import frameless.syntax._

val aptTypedDs2 = aptDs.typed
// aptTypedDs2: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

Typesafe column referencing

This is how we select a particular column from a TypedDataset:

val cities: TypedDataset[String] = aptTypedDs.select(aptTypedDs('city))
// cities: TypedDataset[String] = [value: string]

This is completely type-safe, for instance suppose we misspell city as citi:

aptTypedDs.select(aptTypedDs('citi))
// error: No column Symbol with shapeless.tag.Tagged[String("citi")] of type A in repl.MdocSession.MdocApp0.Apartment

This gets raised at compile time, whereas with the standard Dataset API the error appears at runtime (enjoy the stack trace):

aptDs.select('citi)
// org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `citi` cannot be resolved. Did you mean one of the following? [`city`, `price`, `surface`, `bedrooms`].;
// 'Project ['citi]
// +- LocalRelation [city#64, surface#65, price#66, bedrooms#67]
// 
// 	at org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:306)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$failUnresolvedAttribute(CheckAnalysis.scala:141)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:299)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:297)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:297)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:297)
// 	at scala.collection.immutable.Stream.foreach(Stream.scala:533)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:297)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:215)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:215)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:197)
// 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:202)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:193)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:171)
// 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:202)
// 	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:225)
// 	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
// 	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:222)
// 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
// 	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
// 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
// 	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
// 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
// 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
// 	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
// 	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
// 	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
// 	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
// 	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91)
// 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
// 	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
// 	at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:4352)
// 	at org.apache.spark.sql.Dataset.select(Dataset.scala:1542)
// 	at repl.MdocSession$MdocApp0$$anonfun$15.apply(FeatureOverview.md:95)
// 	at repl.MdocSession$MdocApp0$$anonfun$15.apply(FeatureOverview.md:95)

select() supports arbitrary column operations:

aptTypedDs.select(aptTypedDs('surface) * 10, aptTypedDs('surface) + 2).show().run()
// +----+---+
// |  _1| _2|
// +----+---+
// | 500| 52|
// |1000|102|
// | 250| 27|
// | 830| 85|
// | 450| 47|
// | 740| 76|
// +----+---+
//

Note that unlike the standard Spark API, where some operations are lazy and some are not, all TypedDatasets operations are lazy. In the above example, show() is lazy. It requires to apply run() for the show job to materialize. A more detailed explanation of Job is given here.

Next we compute the price by surface unit:

val priceBySurfaceUnit = aptTypedDs.select(aptTypedDs('price) / aptTypedDs('surface))
// error: overloaded method value / with alternatives:
//   (u: Double)(implicit n: frameless.CatalystNumeric[Double])frameless.TypedColumn[repl.MdocSession.MdocApp0.Apartment,Double] <and>
//   [Out, TT, W](other: frameless.TypedColumn[TT,Double])(implicit n: frameless.CatalystDivisible[Double,Out], implicit e: frameless.TypedEncoder[Out], implicit w: frameless.With[repl.MdocSession.MdocApp0.Apartment,TT]{type Out = W})frameless.TypedColumn[W,Out]
//  cannot be applied to (frameless.TypedColumn[repl.MdocSession.MdocApp0.Apartment,Int])
// val priceBySurfaceUnit = aptTypedDs.select(aptTypedDs('price) / aptTypedDs('surface))
//                                            ^^^^^^^^^^^^^^^^^^^^

As the error suggests, we can't divide a TypedColumn of Double by Int. For safety, in Frameless only math operations between same types is allowed:

val priceBySurfaceUnit = aptTypedDs.select(aptTypedDs('price) / aptTypedDs('surface).cast[Double])
// priceBySurfaceUnit: TypedDataset[Double] = [value: double]
priceBySurfaceUnit.collect().run()
// res5: Seq[Double] = WrappedArray(
//   6000.0,
//   4500.0,
//   10000.0,
//   2409.6385542168673,
//   2955.5555555555557,
//   4391.891891891892
// )

Looks like it worked, but that cast seems unsafe right? Actually it is safe. Let's try to cast a TypedColumn of String to Double:

aptTypedDs('city).cast[Double]
// error: could not find implicit value for parameter c: frameless.CatalystCast[String,Double]

The compile-time error tells us that to perform the cast, an evidence (in the form of CatalystCast[String, Double]) must be available. Since casting from String to Double is not allowed, this results in a compilation error.

Check here for the set of available CatalystCast.

Working with Optional columns

When working with real data we have to deal with imperfections, such as missing fields. Columns that may have missing data should be represented using Options. For this example, let's assume that the Apartments dataset may have missing values.

case class ApartmentOpt(city: Option[String], surface: Option[Int], price: Option[Double], bedrooms: Option[Int])

val apartmentsOpt = Seq(
  ApartmentOpt(Some("Paris"), Some(50),  Some(300000.0), None),
  ApartmentOpt(None, None, Some(450000.0), Some(3))
)

val aptTypedDsOpt = TypedDataset.create(apartmentsOpt)
// aptTypedDsOpt: TypedDataset[ApartmentOpt] = [city: string, surface: int ... 2 more fields]
aptTypedDsOpt.show().run()
// +-----+-------+--------+--------+
// | city|surface|   price|bedrooms|
// +-----+-------+--------+--------+
// |Paris|     50|300000.0|    NULL|
// | NULL|   NULL|450000.0|       3|
// +-----+-------+--------+--------+
//

Unfortunately the syntax used above with select() will not work here:

aptTypedDsOpt.select(aptTypedDsOpt('surface) * 10, aptTypedDsOpt('surface) + 2).show().run()
// error: overloaded method value * with alternatives:
//   (u: Option[Int])(implicit n: frameless.CatalystNumeric[Option[Int]])frameless.TypedColumn[ApartmentOpt,Option[Int]] <and>
//   [TT, W](other: frameless.TypedColumn[TT,Option[Int]])(implicit n: frameless.CatalystNumeric[Option[Int]], implicit w: frameless.With[ApartmentOpt,TT]{type Out = W}, implicit t: scala.reflect.ClassTag[Option[Int]])frameless.TypedColumn[W,Option[Int]]
//  cannot be applied to (Int)
// aptTypedDsOpt.select(aptTypedDsOpt('surface) * 10, aptTypedDsOpt('surface) + 2).show().run()
//                      ^^^^^^^^^^^^^^^^^^^^^^^^^
// error: overloaded method value + with alternatives:
//   (u: Option[Int])(implicit n: frameless.CatalystNumeric[Option[Int]])frameless.TypedColumn[ApartmentOpt,Option[Int]] <and>
//   [TT, W](other: frameless.TypedColumn[TT,Option[Int]])(implicit n: frameless.CatalystNumeric[Option[Int]], implicit w: frameless.With[ApartmentOpt,TT]{type Out = W})frameless.TypedColumn[W,Option[Int]]
//  cannot be applied to (Int)
// aptTypedDsOpt.select(aptTypedDsOpt('surface) * 10, aptTypedDsOpt('surface) + 2).show().run()
//                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^

This is because we cannot multiple an Option with an Int. In Scala, Option has a map() method to help address exactly this (e.g., Some(10).map(c => c * 2)). Frameless follows a similar convention. By applying the opt method on any Option[X] column you can then use map() to provide a function that works with the unwrapped type X. This is best shown in the example bellow:

aptTypedDsOpt.select(aptTypedDsOpt('surface).opt.map(c => c * 10), aptTypedDsOpt('surface).opt.map(_ + 2)).show().run()

Known issue: map() will throw a runtime exception when the applied function includes a udf(). If you want to apply a udf() to an optional column, we recommend changing your udf to work directly with Optional fields.

Casting and projections

In the general case, select() returns a TypedDataset of type TypedDataset[TupleN[...]] (with N in [1...10]). For example, if we select three columns with types String, Int, and Boolean the result will have type TypedDataset[(String, Int, Boolean)].

We often want to give more expressive types to the result of our computations. as[T] allows us to safely cast a TypedDataset[U] to another of type TypedDataset[T] as long as the types in U and T align.

When the cast is valid the expression compiles:

case class UpdatedSurface(city: String, surface: Int)
val updated = aptTypedDs.select(aptTypedDs('city), aptTypedDs('surface) + 2).as[UpdatedSurface]
// updated: TypedDataset[UpdatedSurface] = [city: string, surface: int]
updated.show(2).run()
// +-----+-------+
// | city|surface|
// +-----+-------+
// |Paris|     52|
// |Paris|    102|
// +-----+-------+
// only showing top 2 rows
//

Next we try to cast a (String, String) to an UpdatedSurface (which has types String, Int). The cast is not valid and the expression does not compile:

aptTypedDs.select(aptTypedDs('city), aptTypedDs('city)).as[UpdatedSurface]
// error: could not find implicit value for parameter as: frameless.ops.As[(String, String),UpdatedSurface]

Advanced topics with `select()`

When you select() a single column that has type A, the resulting type is TypedDataset[A] and not TypedDataset[Tuple1[A]]. This behavior makes working with nested schema easier (i.e., in the case where A is a complex data type) and simplifies type-checking column operations (e.g., verify that two columns can be added, divided, etc.). However, when A is scalar, say a Long, it makes it harder to select and work with the resulting TypedDataset[Long]. For instance, it's harder to reference this single scalar column using select(). If this becomes an issue, you can bypass this behavior by using the selectMany() method instead of select(). In the previous example, selectMany() will return TypedDataset[Tuple1[Long]] and you can reference its single column using the name _1. selectMany() should also be used when you need to select more than 10 columns. select() has better IDE support and compiles faster than the macro based selectMany(), so prefer select() for the most common use cases.

When you are handed a single scalar column TypedDataset (e.g., TypedDataset[Double]) the best way to reference its single column is using the asCol (short for "as a column") method. This is best shown in the example below. We will see more usages of asCol later in this tutorial.

val priceBySurfaceUnit = aptTypedDs.select(aptTypedDs('price) / aptTypedDs('surface).cast[Double])
// priceBySurfaceUnit: TypedDataset[Double] = [value: double]
priceBySurfaceUnit.select(priceBySurfaceUnit.asCol * 2).show(2).run()
// +-------+
// |  value|
// +-------+
// |12000.0|
// | 9000.0|
// +-------+
// only showing top 2 rows
//

Projections

We often want to work with a subset of the fields in a dataset. Projections allow us to easily select our fields of interest while preserving their initial names and types for extra safety.

Here is an example using the TypedDataset[Apartment] with an additional column:

val aptds = aptTypedDs // For shorter expressions
// aptds: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

case class ApartmentDetails(city: String, price: Double, surface: Int, ratio: Double)
val aptWithRatio =
  aptds.select(
    aptds('city),
    aptds('price),
    aptds('surface),
    aptds('price) / aptds('surface).cast[Double]
  ).as[ApartmentDetails]
// aptWithRatio: TypedDataset[ApartmentDetails] = [city: string, price: double ... 2 more fields]

Suppose we only want to work with city and ratio:

case class CityInfo(city: String, ratio: Double)

val cityRatio = aptWithRatio.project[CityInfo]
// cityRatio: TypedDataset[CityInfo] = [city: string, ratio: double]

cityRatio.show(2).run()
// +-----+------+
// | city| ratio|
// +-----+------+
// |Paris|6000.0|
// |Paris|4500.0|
// +-----+------+
// only showing top 2 rows
//

Suppose we only want to work with price and ratio:

case class PriceInfo(ratio: Double, price: Double)

val priceInfo = aptWithRatio.project[PriceInfo]
// priceInfo: TypedDataset[PriceInfo] = [ratio: double, price: double]

priceInfo.show(2).run()
// +------+--------+
// | ratio|   price|
// +------+--------+
// |6000.0|300000.0|
// |4500.0|450000.0|
// +------+--------+
// only showing top 2 rows
//

We see that the order of the fields does not matter as long as the names and the corresponding types agree. However, if we make a mistake in any of the names and/or their types, then we get a compilation error.

Say we make a typo in a field name:

case class PriceInfo2(ratio: Double, pricEE: Double)

aptWithRatio.project[PriceInfo2]
// error: Cannot prove that ApartmentDetails can be projected to PriceInfo2. Perhaps not all member names and types of PriceInfo2 are the same in ApartmentDetails?

Say we make a mistake in the corresponding type:

case class PriceInfo3(ratio: Int, price: Double) // ratio should be Double

aptWithRatio.project[PriceInfo3]
// error: Cannot prove that ApartmentDetails can be projected to PriceInfo3. Perhaps not all member names and types of PriceInfo3 are the same in ApartmentDetails?

Union of TypedDatasets

Lets create a projection of our original dataset with a subset of the fields.

case class ApartmentShortInfo(city: String, price: Double, bedrooms: Int)

val aptTypedDs2: TypedDataset[ApartmentShortInfo] = aptTypedDs.project[ApartmentShortInfo]

The union of aptTypedDs2 with aptTypedDs uses all the fields of the caller (aptTypedDs2) and expects the other dataset (aptTypedDs) to include all those fields. If field names/types do not match you get a compilation error.

aptTypedDs2.union(aptTypedDs).show().run
// +-----+--------+--------+
// | city|   price|bedrooms|
// +-----+--------+--------+
// |Paris|300000.0|       2|
// |Paris|450000.0|       3|
// |Paris|250000.0|       1|
// | Lyon|200000.0|       2|
// | Lyon|133000.0|       1|
// | Nice|325000.0|       3|
// |Paris|300000.0|       2|
// |Paris|450000.0|       3|
// |Paris|250000.0|       1|
// | Lyon|200000.0|       2|
// | Lyon|133000.0|       1|
// | Nice|325000.0|       3|
// +-----+--------+--------+
//

The other way around will not compile, since aptTypedDs2 has only a subset of the fields.

aptTypedDs.union(aptTypedDs2).show().run
// error: Cannot prove that ApartmentShortInfo can be projected to repl.MdocSession.MdocApp0.Apartment. Perhaps not all member names and types of repl.MdocSession.MdocApp0.Apartment are the same in ApartmentShortInfo?
// Error occurred in an application involving default arguments.

Finally, as with project, union will align fields that have same names/types, so fields do not have to be in the same order.

TypedDataset functions and transformations

Frameless supports many of Spark's functions and transformations. However, whenever a Spark function does not exist in Frameless, calling .dataset will expose the underlying Dataset (from org.apache.spark.sql, the original Spark APIs), where you can use anything that would be missing from the Frameless' API.

These are the main imports for Frameless' aggregate and non-aggregate functions.

import frameless.functions._                // For literals
import frameless.functions.nonAggregate._   // e.g., concat, abs
import frameless.functions.aggregate._      // e.g., count, sum, avg

Drop/Replace/Add fields

dropTupled() drops a single column and results in a tuple-based schema.

aptTypedDs2.dropTupled('price): TypedDataset[(String,Int)]
// res18: TypedDataset[(String, Int)] = [_1: string, _2: int]

To drop a column and specify a new schema use drop().

case class CityBeds(city: String, bedrooms: Int)
val cityBeds: TypedDataset[CityBeds] = aptTypedDs2.drop[CityBeds] 
// cityBeds: TypedDataset[CityBeds] = [city: string, bedrooms: int]

Often, you want to replace an existing column with a new value.

val inflation = aptTypedDs2.withColumnReplaced('price, aptTypedDs2('price) * 2)
// inflation: TypedDataset[ApartmentShortInfo] = [city: string, price: double ... 1 more field]
 
inflation.show(2).run()
// +-----+--------+--------+
// | city|   price|bedrooms|
// +-----+--------+--------+
// |Paris|600000.0|       2|
// |Paris|900000.0|       3|
// +-----+--------+--------+
// only showing top 2 rows
//

Or use a literal instead.

import frameless.functions.lit
aptTypedDs2.withColumnReplaced('price, lit(0.001)) 
// res20: TypedDataset[ApartmentShortInfo] = [city: string, price: double ... 1 more field]

Adding a column using withColumnTupled() results in a tupled-based schema.

aptTypedDs2.withColumnTupled(lit(Array("a","b","c"))).show(2).run()
// +-----+--------+---+---------+
// |   _1|      _2| _3|       _4|
// +-----+--------+---+---------+
// |Paris|300000.0|  2|[a, b, c]|
// |Paris|450000.0|  3|[a, b, c]|
// +-----+--------+---+---------+
// only showing top 2 rows
//

Similarly, withColumn() adds a column and explicitly expects a schema for the result.

case class CityBedsOther(city: String, bedrooms: Int, other: List[String])

cityBeds.
   withColumn[CityBedsOther](lit(List("a","b","c"))).
   show(1).run()
// +-----+--------+---------+
// | city|bedrooms|    other|
// +-----+--------+---------+
// |Paris|       2|[a, b, c]|
// +-----+--------+---------+
// only showing top 1 row
//

To conditionally change a column use the when/otherwise operation.

import frameless.functions.nonAggregate.when
aptTypedDs2.withColumnTupled(
   when(aptTypedDs2('city) === "Paris", aptTypedDs2('price)).
   when(aptTypedDs2('city) === "Lyon", lit(1.1)).
   otherwise(lit(0.0))).show(8).run()
// +-----+--------+---+--------+
// |   _1|      _2| _3|      _4|
// +-----+--------+---+--------+
// |Paris|300000.0|  2|300000.0|
// |Paris|450000.0|  3|450000.0|
// |Paris|250000.0|  1|250000.0|
// | Lyon|200000.0|  2|     1.1|
// | Lyon|133000.0|  1|     1.1|
// | Nice|325000.0|  3|     0.0|
// +-----+--------+---+--------+
//

A simple way to add a column without losing important schema information is to project the entire source schema into a single column using the asCol() method.

val c = cityBeds.select(cityBeds.asCol, lit(List("a","b","c")))
// c: TypedDataset[(CityBeds, List[String])] = [_1: struct<city: string, bedrooms: int>, _2: array<string>]
c.show(1).run()
// +----------+---------+
// |        _1|       _2|
// +----------+---------+
// |{Paris, 2}|[a, b, c]|
// +----------+---------+
// only showing top 1 row
//

When working with Spark's DataFrames, you often select all columns using .select($"*", ...). In a way, asCol() is a typed equivalent of $"*".

To access nested columns, use the colMany() method.

c.select(c.colMany('_1, 'city), c('_2)).show(2).run()
// +-----+---------+
// |   _1|       _2|
// +-----+---------+
// |Paris|[a, b, c]|
// |Paris|[a, b, c]|
// +-----+---------+
// only showing top 2 rows
//

Working with collections

import frameless.functions._
import frameless.functions.nonAggregate._

val t = cityRatio.select(cityRatio('city), lit(List("abc","c","d")))
// t: TypedDataset[(String, List[String])] = [_1: string, _2: array<string>]
t.withColumnTupled(
   arrayContains(t('_2), "abc")
).show(1).run()
// +-----+-----------+----+
// |   _1|         _2|  _3|
// +-----+-----------+----+
// |Paris|[abc, c, d]|true|
// +-----+-----------+----+
// only showing top 1 row
//

If accidentally you apply a collection function on a column that is not a collection, you get a compilation error.

t.withColumnTupled(
   arrayContains(t('_1), "abc")
)
// error: no type parameters for method arrayContains: (column: frameless.AbstractTypedColumn[T,C[A]], value: A)(implicit evidence$1: frameless.CatalystCollection[C])column.ThisType[T,Boolean] exist so that it can be applied to arguments (frameless.TypedColumn[(String, List[String]),String], String)
//  --- because ---
// argument expression's type is not compatible with formal parameter type;
//  found   : frameless.TypedColumn[(String, List[String]),String]
//  required: frameless.AbstractTypedColumn[?T,?C[?A]]
// 
// Error occurred in an application involving default arguments.
// error: type mismatch;
//  found   : frameless.TypedColumn[(String, List[String]),String]
//  required: frameless.AbstractTypedColumn[T,C[A]]
// Error occurred in an application involving default arguments.
// error: type mismatch;
//  found   : String("abc")
//  required: A
// Error occurred in an application involving default arguments.
//    arrayContains(t('_1), "abc")
//                          ^^^^^
// error: Cannot do collection operations on columns of type C.
// Error occurred in an application involving default arguments.

Flattening columns in Spark is done with the explode() method. Unlike vanilla Spark, in Frameless explode() is part of TypedDataset and not a function of a column. This provides additional safety since more than one explode() applied in a single statement results in runtime error in vanilla Spark.

val t2 = cityRatio.select(cityRatio('city), lit(List(1,2,3,4)))
// t2: TypedDataset[(String, List[Int])] = [_1: string, _2: array<int>]
val flattened = t2.explode('_2): TypedDataset[(String, Int)]
// flattened: TypedDataset[(String, Int)] = [_1: string, _2: int]
flattened.show(4).run()
// +-----+---+
// |   _1| _2|
// +-----+---+
// |Paris|  1|
// |Paris|  2|
// |Paris|  3|
// |Paris|  4|
// +-----+---+
// only showing top 4 rows
//

Here is an example of how explode() may fail in vanilla Spark. The Frameless implementation does not suffer from this problem since, by design, it can only be applied to a single column at a time.

{
  import org.apache.spark.sql.functions.{explode => sparkExplode}
  t2.dataset.toDF().select(sparkExplode($"_2"), sparkExplode($"_2"))
}
// error: Unit does not take parameters
// Error occurred in an application involving default arguments.

Collecting data to the driver

In Frameless all Spark actions (such as collect()) are safe.

Take the first element from a dataset (if the dataset is empty return None).

cityBeds.headOption.run()
// res30: Option[CityBeds] = Some(CityBeds("Paris", 2))

Take the first n elements.

cityBeds.take(2).run()
// res31: Seq[CityBeds] = WrappedArray(
//   CityBeds("Paris", 2),
//   CityBeds("Paris", 3)
// )

cityBeds.head(3).run()
// res32: Seq[CityBeds] = WrappedArray(
//   CityBeds("Paris", 2),
//   CityBeds("Paris", 3),
//   CityBeds("Paris", 1)
// )

cityBeds.limit(4).collect().run()
// res33: Seq[CityBeds] = WrappedArray(
//   CityBeds("Paris", 2),
//   CityBeds("Paris", 3),
//   CityBeds("Paris", 1),
//   CityBeds("Lyon", 2)
// )

Sorting columns

Only column types that can be sorted are allowed to be selected for sorting.

aptTypedDs.orderBy(aptTypedDs('city).asc).show(2).run()
// +----+-------+--------+--------+
// |city|surface|   price|bedrooms|
// +----+-------+--------+--------+
// |Lyon|     83|200000.0|       2|
// |Lyon|     45|133000.0|       1|
// +----+-------+--------+--------+
// only showing top 2 rows
//

The ordering can be changed by selecting .acs or .desc.

aptTypedDs.orderBy(
   aptTypedDs('city).asc, 
   aptTypedDs('price).desc
).show(2).run()
// +----+-------+--------+--------+
// |city|surface|   price|bedrooms|
// +----+-------+--------+--------+
// |Lyon|     83|200000.0|       2|
// |Lyon|     45|133000.0|       1|
// +----+-------+--------+--------+
// only showing top 2 rows
//

User Defined Functions

Frameless supports lifting any Scala function (up to five arguments) to the context of a particular TypedDataset:

// The function we want to use as UDF
val priceModifier =
    (name: String, price:Double) => if(name == "Paris") price * 2.0 else price
// priceModifier: (String, Double) => Double = <function2>

val udf = aptTypedDs.makeUDF(priceModifier)
// udf: (frameless.TypedColumn[Apartment, String], frameless.TypedColumn[Apartment, Double]) => frameless.TypedColumn[Apartment, Double] = frameless.functions.Udf$$Lambda$14297/0x0000000803d24040@59b58096

val aptds = aptTypedDs // For shorter expressions
// aptds: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

val adjustedPrice = aptds.select(aptds('city), udf(aptds('city), aptds('price)))
// adjustedPrice: TypedDataset[(String, Double)] = [_1: string, _2: double]

adjustedPrice.show().run()
// +-----+--------+
// |   _1|      _2|
// +-----+--------+
// |Paris|600000.0|
// |Paris|900000.0|
// |Paris|500000.0|
// | Lyon|200000.0|
// | Lyon|133000.0|
// | Nice|325000.0|
// +-----+--------+
//

GroupBy and Aggregations

Let's suppose we wanted to retrieve the average apartment price in each city

val priceByCity = aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('price)))
// priceByCity: TypedDataset[(String, Double)] = [_1: string, _2: double]
priceByCity.collect().run()
// res37: Seq[(String, Double)] = WrappedArray(
//   ("Paris", 333333.3333333333),
//   ("Lyon", 166500.0),
//   ("Nice", 325000.0)
// )

Again if we try to aggregate a column that can't be aggregated, we get a compilation error

aptTypedDs.groupBy(aptTypedDs('city)).agg(avg(aptTypedDs('city)))
// error: Cannot compute average of type String.
// Error occurred in an application involving default arguments.

Next, we combine select and groupBy to calculate the average price/surface ratio per city:

val aptds = aptTypedDs // For shorter expressions
// aptds: TypedDataset[Apartment] = [city: string, surface: int ... 2 more fields]

val cityPriceRatio =  aptds.select(aptds('city), aptds('price) / aptds('surface).cast[Double])
// cityPriceRatio: TypedDataset[(String, Double)] = [_1: string, _2: double]

cityPriceRatio.groupBy(cityPriceRatio('_1)).agg(avg(cityPriceRatio('_2))).show().run()
// +-----+------------------+
// |   _1|                _2|
// +-----+------------------+
// |Paris| 6833.333333333333|
// | Lyon|2682.5970548862115|
// | Nice| 4391.891891891892|
// +-----+------------------+
//

We can also use pivot to further group data on a secondary column. For example, we can compare the average price across cities by number of bedrooms.

case class BedroomStats(
   city: String,
   AvgPriceBeds1: Option[Double], // Pivot values may be missing, so we encode them using Options
   AvgPriceBeds2: Option[Double],
   AvgPriceBeds3: Option[Double],
   AvgPriceBeds4: Option[Double])

val bedroomStats = aptds.
   groupBy(aptds('city)).
   pivot(aptds('bedrooms)).
   on(1,2,3,4). // We only care for up to 4 bedrooms
   agg(avg(aptds('price))).
   as[BedroomStats]  // Typesafe casting
// bedroomStats: TypedDataset[BedroomStats] = [city: string, AvgPriceBeds1: double ... 3 more fields]

bedroomStats.show().run()
// +-----+-------------+-------------+-------------+-------------+
// | city|AvgPriceBeds1|AvgPriceBeds2|AvgPriceBeds3|AvgPriceBeds4|
// +-----+-------------+-------------+-------------+-------------+
// | Nice|         NULL|         NULL|     325000.0|         NULL|
// |Paris|     250000.0|     300000.0|     450000.0|         NULL|
// | Lyon|     133000.0|     200000.0|         NULL|         NULL|
// +-----+-------------+-------------+-------------+-------------+
//

With pivot, collecting data preserves typesafety by encoding potentially missing columns with Option.

bedroomStats.collect().run().foreach(println)
// BedroomStats(Nice,None,None,Some(325000.0),None)
// BedroomStats(Paris,Some(250000.0),Some(300000.0),Some(450000.0),None)
// BedroomStats(Lyon,Some(133000.0),Some(200000.0),None,None)

Working with Optional fields

Optional fields can be converted to non-optional using getOrElse().

val sampleStats = bedroomStats.select(
   bedroomStats('AvgPriceBeds2).getOrElse(0.0),
   bedroomStats('AvgPriceBeds3).getOrElse(0.0))
// sampleStats: TypedDataset[(Double, Double)] = [_1: double, _2: double]

sampleStats.show().run()   
// +--------+--------+
// |      _1|      _2|
// +--------+--------+
// |     0.0|325000.0|
// |300000.0|450000.0|
// |200000.0|     0.0|
// +--------+--------+
//

In addition, optional columns can be flatten using the .flattenOption method on TypedDatset. The result contains the rows for which the flattened column is not None (or null). The schema is automatically adapted to reflect this change.

val flattenStats = bedroomStats.flattenOption('AvgPriceBeds2)
// flattenStats: TypedDataset[shapeless.ops.TuplerInstances.<refinement>.this.type.Out] = [_1: string, _2: double ... 3 more fields]


// The second Option[Double] is now of type Double, since all 'null' values are removed
flattenStats: TypedDataset[(String, Option[Double], Double, Option[Double], Option[Double])]
// res43: TypedDataset[(String, Option[Double], Double, Option[Double], Option[Double])] = [_1: string, _2: double ... 3 more fields]

In a DataFrame, if you just ignore types, this would equivelantly be written as:

bedroomStats.dataset.toDF().filter($"AvgPriceBeds2".isNotNull)
// res44: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [city: string, AvgPriceBeds1: double ... 3 more fields]

Entire TypedDataset Aggregation

We often want to aggregate the entire TypedDataset and skip the groupBy() clause. In Frameless you can do this using the agg() operator directly on the TypedDataset. In the following example, we compute the average price, the average surface, the minimum surface, and the set of cities for the entire dataset.

case class Stats(
   avgPrice: Double,
   avgSurface: Double,
   minSurface: Int,
   allCities: Vector[String])

aptds.agg(
   avg(aptds('price)),
   avg(aptds('surface)),
   min(aptds('surface)),
   collectSet(aptds('city))
).as[Stats].show().run()
// +-----------------+------------------+----------+-------------------+
// |         avgPrice|        avgSurface|minSurface|          allCities|
// +-----------------+------------------+----------+-------------------+
// |276333.3333333333|62.833333333333336|        25|[Paris, Nice, Lyon]|
// +-----------------+------------------+----------+-------------------+
//

You may apply any TypedColumn operation to a TypedAggregate column as well.

import frameless.functions._
aptds.agg(
   avg(aptds('price)) * min(aptds('surface)).cast[Double], 
   avg(aptds('surface)) * 0.2,
   litAggr("Hello World")
).show().run()
// +-----------------+------------------+-----------+
// |               _1|                _2|         _3|
// +-----------------+------------------+-----------+
// |6908333.333333333|12.566666666666668|Hello World|
// +-----------------+------------------+-----------+
//

Joins

case class CityPopulationInfo(name: String, population: Int)

val cityInfo = Seq(
  CityPopulationInfo("Paris", 2229621),
  CityPopulationInfo("Lyon", 500715),
  CityPopulationInfo("Nice", 343629)
)

val citiInfoTypedDS = TypedDataset.create(cityInfo)

Here is how to join the population information to the apartment's dataset:

val withCityInfo = aptTypedDs.joinInner(citiInfoTypedDS) { aptTypedDs('city) === citiInfoTypedDS('name) }
// withCityInfo: TypedDataset[(Apartment, CityPopulationInfo)] = [_1: struct<city: string, surface: int ... 2 more fields>, _2: struct<name: string, population: int>]

withCityInfo.show().run()
// +--------------------+----------------+
// |                  _1|              _2|
// +--------------------+----------------+
// |{Paris, 50, 30000...|{Paris, 2229621}|
// |{Paris, 100, 4500...|{Paris, 2229621}|
// |{Paris, 25, 25000...|{Paris, 2229621}|
// |{Lyon, 83, 200000...|  {Lyon, 500715}|
// |{Lyon, 45, 133000...|  {Lyon, 500715}|
// |{Nice, 74, 325000...|  {Nice, 343629}|
// +--------------------+----------------+
//

The joined TypedDataset has type TypedDataset[(Apartment, CityPopulationInfo)].

We can then select which information we want to continue to work with:

case class AptPriceCity(city: String, aptPrice: Double, cityPopulation: Int)

withCityInfo.select(
   withCityInfo.colMany('_2, 'name), withCityInfo.colMany('_1, 'price), withCityInfo.colMany('_2, 'population)
).as[AptPriceCity].show().run
// +-----+--------+--------------+
// | city|aptPrice|cityPopulation|
// +-----+--------+--------------+
// |Paris|300000.0|       2229621|
// |Paris|450000.0|       2229621|
// |Paris|250000.0|       2229621|
// | Lyon|200000.0|        500715|
// | Lyon|133000.0|        500715|
// | Nice|325000.0|        343629|
// +-----+--------+--------------+
//

Chained Joins

Joins, or any similar operation, may be chained using a thrush combinator removing the need for intermediate values. Instead of:

val withBedroomInfoInterim = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
// withBedroomInfoInterim: TypedDataset[(Apartment, CityPopulationInfo)] = [_1: struct<city: string, surface: int ... 2 more fields>, _2: struct<name: string, population: int>]
val withBedroomInfo = withBedroomInfoInterim 
  .joinLeft(bedroomStats)( withBedroomInfoInterim.col('_1).field('city) === bedroomStats('city) )
// withBedroomInfo: TypedDataset[((Apartment, CityPopulationInfo), Option[BedroomStats])] = [_1: struct<_1: struct<city: string, surface: int ... 2 more fields>, _2: struct<name: string, population: int>>, _2: struct<city: string, AvgPriceBeds1: double ... 3 more fields>]

withBedroomInfo.show().run()
// +--------------------+--------------------+
// |                  _1|                  _2|
// +--------------------+--------------------+
// |{{Paris, 50, 3000...|{Paris, 250000.0,...|
// |{{Paris, 100, 450...|{Paris, 250000.0,...|
// |{{Paris, 25, 2500...|{Paris, 250000.0,...|
// |{{Lyon, 83, 20000...|{Lyon, 133000.0, ...|
// |{{Lyon, 45, 13300...|{Lyon, 133000.0, ...|
// |{{Nice, 74, 32500...|{Nice, NULL, NULL...|
// +--------------------+--------------------+
//

You can use thrush from mouse:

libraryDependencies += "org.typelevel" %% "mouse" % "1.2.1"

import mouse.all._

val withBedroomInfoChained = aptTypedDs.joinInner(citiInfoTypedDS)( aptTypedDs('city) === citiInfoTypedDS('name) )
  .thrush( interim => interim.joinLeft(bedroomStats)( interim.col('_1).field('city) === bedroomStats('city) ) )
// withBedroomInfoChained: TypedDataset[((Apartment, CityPopulationInfo), Option[BedroomStats])] = [_1: struct<_1: struct<city: string, surface: int ... 2 more fields>, _2: struct<name: string, population: int>>, _2: struct<city: string, AvgPriceBeds1: double ... 3 more fields>]

withBedroomInfoChained.show().run()
// +--------------------+--------------------+
// |                  _1|                  _2|
// +--------------------+--------------------+
// |{{Paris, 50, 3000...|{Paris, 250000.0,...|
// |{{Paris, 100, 450...|{Paris, 250000.0,...|
// |{{Paris, 25, 2500...|{Paris, 250000.0,...|
// |{{Lyon, 83, 20000...|{Lyon, 133000.0, ...|
// |{{Lyon, 45, 13300...|{Lyon, 133000.0, ...|
// |{{Nice, 74, 32500...|{Nice, NULL, NULL...|
// +--------------------+--------------------+
//