Spark - The Elite List
Filter rows by column value in PySpark.
# Showing non-empty rows;
df.where(F.col("col_name").like("% %") == False).show()
df.where(F.col("col_name").like("% %") == False).show()
Create new column in Spark Dataframe from existing column using withColumn method.
df = df.withColumn('new_col', df.old_col[1:2])
Groupby and aggregate example in PySpark.
df.groupby("col").agg(F.min('agg_col').alias('agg')).orderBy('col', ascending=True)
Save Spark Dataframe as Apache Parquet storage in HDFS.
df.write.parquet("hdfs:///user/abc123/outputs/df.parquet")
Load Fixed Width Text file as Dataframe in PySpark.
df = spark.read.text("hdfs:///path/to/dir")
df = df.select( df.value.substr(1,2).alias('ID'), df.value.substr(4,57).alias('NAME') )
df = df.select( df.value.substr(1,2).alias('ID'), df.value.substr(4,57).alias('NAME') )