How to rename all the columns in a DataFrame in one-go?
In Python, there are many ways to achieve the same thing, and it varies person to person.
Below is the first method.
>>> df=spark.read.csv("/employees/employees.csv",header=True,inferSchema=True)
>>> df.printSchema()
root
|-- Emp ID: integer (nullable = true)
|-- Name Prefix: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Middle Initial: string (nullable = true)
|-- Last Name: string (nullable = true)
|-- Gender: string (nullable = true)
|-- E Mail: string (nullable = true)
|-- Father's Name: string (nullable = true)
|-- Mother's Name: string (nullable = true)
|-- Mother's Maiden Name: string (nullable = true)
|-- Date of Birth: string (nullable = true)
|-- Date of Joining: string (nullable = true)
|-- Salary: integer (nullable = true)
|-- Phone No. : string (nullable = true)
|-- Place Name: string (nullable = true)
|-- County: string (nullable = true)
|-- City: string (nullable = true)
|-- State: string (nullable = true)
|-- Zip: integer (nullable = true)
|-- Region: string (nullable = true)
Above schema has space,"." and "'S" character in the column names.
Here is the one-line code to rename all the columns.
>>> df=df.toDF(*(x.replace(" ","_").replace("._","").replace("'s_","_").lower() for x in df.columns))
>>> df.printSchema()
root
|-- emp_id: integer (nullable = true)
|-- name_prefix: string (nullable = true)
|-- first_name: string (nullable = true)
|-- middle_initial: string (nullable = true)
|-- last_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- e_mail: string (nullable = true)
|-- father_name: string (nullable = true)
|-- mother_name: string (nullable = true)
|-- mother_maiden_name: string (nullable = true)
|-- date_of_birth: string (nullable = true)
|-- date_of_joining: string (nullable = true)
|-- salary: integer (nullable = true)
|-- phone_no: string (nullable = true)
|-- place_name: string (nullable = true)
|-- county: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- region: string (nullable = true)
Now all the column names have changed with lower case.
Here is the 2nd method.
def col_rename(df):
for old_col in df.columns:
new_col=old_col.replace("._","").replace("'s_","_").lower()
df=df.withColumnRenamed(old_col,new_col)
return df
df=col_rename(df)