Python: November 2021

How to rename all the columns in a DataFrame in one-go?

In Python, there are many ways to achieve the same thing, and it varies person to person.

Below is the first method.

>>> df=spark.read.csv("/employees/employees.csv",header=True,inferSchema=True)
>>> df.printSchema()
root
|-- Emp ID: integer (nullable = true)
|-- Name Prefix: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Middle Initial: string (nullable = true)
|-- Last Name: string (nullable = true)
|-- Gender: string (nullable = true)
|-- E Mail: string (nullable = true)
|-- Father's Name: string (nullable = true)
|-- Mother's Name: string (nullable = true)
|-- Mother's Maiden Name: string (nullable = true)
|-- Date of Birth: string (nullable = true)
|-- Date of Joining: string (nullable = true)
|-- Salary: integer (nullable = true)
|-- Phone No. : string (nullable = true)
|-- Place Name: string (nullable = true)
|-- County: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Zip: integer (nullable = true)
|-- Region: string (nullable = true)

Above schema has space,"." and "'S" character in the column names.

Here is the one-line code to rename all the columns.

>>> df=df.toDF(*(x.replace(" ","_").replace("._","").replace("'s_","_").lower() for x in df.columns))
>>> df.printSchema()
root
|-- emp_id: integer (nullable = true)
|-- name_prefix: string (nullable = true)
|-- first_name: string (nullable = true)
|-- middle_initial: string (nullable = true)
|-- last_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- e_mail: string (nullable = true)
|-- father_name: string (nullable = true)
|-- mother_name: string (nullable = true)
|-- mother_maiden_name: string (nullable = true)
|-- date_of_birth: string (nullable = true)
|-- date_of_joining: string (nullable = true)
|-- salary: integer (nullable = true)
|-- phone_no: string (nullable = true)
|-- place_name: string (nullable = true)
|-- county: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- region: string (nullable = true)

Now all the column names have changed with lower case.

Here is the 2nd method.

def col_rename(df):

for old_col in df.columns:

new_col=old_col.replace("._","").replace("'s_","_").lower()

df=df.withColumnRenamed(old_col,new_col)

return df

df=col_rename(df)

How to load a textfile to Spark RDD and convert it to a Spark DataFrame?

import os

os.system("cat /home/mehaa/family.csv") -- Please use the path where the file is in your machine. On my machine I have saved the file in above path.

rdd=sc.textFile('/home/mehaa/family.csv')

rdd.collect()

Now, we need to split the records.

>>> rdd1=rdd.map(lambda x:x.split(','))
>>> rdd1.collect()
[['102', 'Gokula', '37', 'Mother'], ['103', 'Mehaa', '5', 'Daughter'], ['104', 'Rithihaa', '2', 'Daughter']]
>>>

As you can see from the above image, we have 4 columns and 3 rows.

We need to provide the meaningful column names to those.

We must import Row function to create the columns from the RDD.

>>> from pyspark.sql import Row

>>> rdd2=rdd1.map(lambda x:Row(id=x[0],name=x[1],age=int(x[2]),reln=x[3]))

>>> df=rdd2.toDF()
>>> df.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+

We can create DataFrame using createDataFrame method as well.

>>> df1=spark.createDataFrame(rdd2)
>>> df1.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+

How to read Hive tables in PySpark?

Here is the video.

How to read MySql table in PySpark?

PySpark supports many data sources. Below are some samples.

CSV, JSON,ORC,Parquet, JDBC and etc...

Here is an example to read the data from MySql.

Code : spark.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/employees?useSSL=false&user=root&password=mehaa1903",dbtable="departments").load().show()