Spark - The Elite List


Filter rows by column value in PySpark.

# Showing non-empty rows;
df.where(F.col("col_name").like("% %") == False).show()

Create new column in Spark Dataframe from existing column using withColumn method.

df = df.withColumn('new_col', df.old_col[1:2])

Groupby and aggregate example in PySpark.

df.groupby("col").agg(F.min('agg_col').alias('agg')).orderBy('col', ascending=True)

Save Spark Dataframe as Apache Parquet storage in HDFS.

df.write.parquet("hdfs:///user/abc123/outputs/df.parquet")

Load Fixed Width Text file as Dataframe in PySpark.

df = spark.read.text("hdfs:///path/to/dir")
df = df.select( df.value.substr(1,2).alias('ID'), df.value.substr(4,57).alias('NAME') )