An easy way to convert json into csv

Thu 05 March 2020

I was asked recently to export data from a service that's going offline soon. It wouldn't be an interesting topic to write about but there was a catch. Exported data are in json format, and I was asked to provide csv files. When I started with data export I noticed that some rows got duplicated. I could add a simple fix for that in export script but the problem was that running the script took couple days. So, apart from converting json to csv I had to remove duplicates.

While export script was one time thing, data conversion seems like something I would need to do on various occasions. Having that in mind I have created simple script that uses pandas for data conversion and duplicates removal. Script requires path argument which is a location of json files to convert.

import argparse
import os
from glob import glob

import pandas as pd


def convert(input_file):
    df = pd.read_json(input_file)

    # we need to convert lists and distionaries to a string. otherwise padas will throw exception
    columns = list(df)
    for col in columns:
        if df[col].dtype == 'object':
            df[col] = df[col].astype('str')

    dirname = os.path.dirname(input_file)
    filename = os.path.basename(input_file)

    df.drop_duplicates(inplace=True, ignore_index=True)
    df.to_csv(f'{dirname}/{filename.replace(".json", ".csv")}', index=False)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--path')
    args = parser.parse_args()

    for file in glob(f'{args.path}/**/*.json', recursive=True):
        print(f'processing {file}')
        convert(file)

One modification I will probably make soon will be second option that will allow skipping duplicates removal.

This was my first real life pandas experience so don't be to harsh if I made a mistake.