Python

How to load a textfile to Spark RDD and convert it to a Spark DataFrame?

import os

os.system("cat /home/mehaa/family.csv") -- Please use the path where the file is in your machine. On my machine I have saved the file in above path.

rdd=sc.textFile('/home/mehaa/family.csv')

rdd.collect()

Now, we need to split the records.

>>> rdd1=rdd.map(lambda x:x.split(','))
>>> rdd1.collect()
[['102', 'Gokula', '37', 'Mother'], ['103', 'Mehaa', '5', 'Daughter'], ['104', 'Rithihaa', '2', 'Daughter']]
>>>

As you can see from the above image, we have 4 columns and 3 rows.

We need to provide the meaningful column names to those.

We must import Row function to create the columns from the RDD.

>>> from pyspark.sql import Row

>>> rdd2=rdd1.map(lambda x:Row(id=x[0],name=x[1],age=int(x[2]),reln=x[3]))

>>> df=rdd2.toDF()
>>> df.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+

We can create DataFrame using createDataFrame method as well.

>>> df1=spark.createDataFrame(rdd2)
>>> df1.show()
+---+--------+---+--------+
| id| name|age| reln|
+---+--------+---+--------+
|102| Gokula| 37| Mother|
|103| Mehaa| 5|Daughter|
|104|Rithihaa| 2|Daughter|
+---+--------+---+--------+

Python

No comments:

Post a Comment

Posts