Pyspark

Pyspark commands for managing data analysis, manipulation, and visualization

15 commands

Commands

15 commands available
pyspark

Initialize Spark

Create SparkSession to start PySpark

from pyspark.sql import SparkSession
pyspark
init
spark
pyspark

Create DataFrame

Create DataFrame from list or RDD

spark.createDataFrame()
pyspark
dataframe
create
pyspark

Read CSV

Load CSV file into DataFrame

spark.read.csv()
pyspark
read
csv
pyspark

Show DataFrame

Display rows of DataFrame

df.show()
pyspark
dataframe
view
pyspark

Print Schema

Print schema of DataFrame

df.printSchema()
pyspark
dataframe
schema
pyspark

Select Columns

Select specific columns from DataFrame

df.select("col1","col2")
pyspark
dataframe
select
pyspark

Filter Rows

Filter DataFrame based on condition

df.filter()
pyspark
dataframe
filter
pyspark

Where

Alternative to filter()

df.where()
pyspark
dataframe
where
pyspark

Group By

Group DataFrame by column(s)

df.groupBy("col")
pyspark
dataframe
aggregate
pyspark

Aggregation

Aggregate with functions like sum, avg, max

df.agg()
pyspark
dataframe
aggregate
pyspark

Order By

Sort DataFrame by column(s)

df.orderBy()
pyspark
dataframe
sort
pyspark

Join DataFrames

Join two DataFrames

df1.join(df2, on="col", how="inner")
pyspark
dataframe
join
pyspark

Drop Column

Drop column(s) from DataFrame

df.drop("col")
pyspark
dataframe
drop
pyspark

Run SQL

Execute SQL queries

spark.sql()
pyspark
sql
query
pyspark

Save DataFrame

Write DataFrame to storage

df.write.format("csv").save("path")
pyspark
dataframe
save