How to load a textfile to Spark RDD and convert it to a Spark DataFrame?

import os

os.system("cat /home/mehaa/family.csv") -- Please use the path where the file is in your machine. On my machine I have saved the file in above path. 





 

rdd=sc.textFile('/home/mehaa/family.csv')

rdd.collect()

 

 

Now, we need to split the records. 

>>> rdd1=rdd.map(lambda x:x.split(','))
>>> rdd1.collect()
[['102', 'Gokula', '37', 'Mother'], ['103', 'Mehaa', '5', 'Daughter'], ['104', 'Rithihaa', '2', 'Daughter']]
>>>  

As you can see from the above image, we have 4 columns and 3 rows.

We need to provide the meaningful column names to those. 

We must import Row function to create the columns from the RDD.

 >>> from pyspark.sql import Row

>>> rdd2=rdd1.map(lambda x:Row(id=x[0],name=x[1],age=int(x[2]),reln=x[3]))

 

 

>>> df=rdd2.toDF()
>>> df.show()
+---+--------+---+--------+
| id|    name|age|    reln|
+---+--------+---+--------+
|102|  Gokula| 37|  Mother|
|103|   Mehaa|  5|Daughter|
|104|Rithihaa|  2|Daughter|
+---+--------+---+--------+

We can create DataFrame using createDataFrame method as well.

>>> df1=spark.createDataFrame(rdd2)
>>> df1.show()
+---+--------+---+--------+
| id|    name|age|    reln|
+---+--------+---+--------+
|102|  Gokula| 37|  Mother|
|103|   Mehaa|  5|Daughter|
|104|Rithihaa|  2|Daughter|
+---+--------+---+--------+


 

 



No comments:

Post a Comment